> The problem was that I was accessing a device pointer on the host.
> 
> Maybe the fact that valgrind did not print a source code line (it was in host 
> code) is a hint that you are accessing a device pointer?
> 
> ==77820== Invalid read of size 4
> ==77820==    at 0x7E69068: LandauKokkosJacobian (in 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0)
> ==77820==    by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212)

When in doubt use cuda-memcheck whenever doing any debugging with gpus, its the 
cuda version of valgrind and I cannot recommend it enough. Not directly related 
but it also comes with a suite of other useful gpu-related tools that catch 
race conditions, uninitialized memory accesses and deadlocks.

https://docs.nvidia.com/cuda/cuda-memcheck/index.html

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

> On May 30, 2021, at 09:06, Mark Adams <mfad...@lbl.gov> wrote:
> 
> The problem was that I was accessing a device pointer on the host.
> 
> Maybe the fact that valgrind did not print a source code line (it was in host 
> code) is a hint that you are accessing a device pointer?
> 
> ==77820== Invalid read of size 4
> ==77820==    at 0x7E69068: LandauKokkosJacobian (in 
> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0)
> ==77820==    by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212)
> 
> This access is in landau.kokkos.cxx but no source line number.
> 
> Thanks,
> 
> 
> On Sun, May 30, 2021 at 12:48 AM Mark Adams <mfad...@lbl.gov 
> <mailto:mfad...@lbl.gov>> wrote:
> 
> 
> On Sun, May 30, 2021 at 12:08 AM Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> wrote:
> 
>    Try without Valgrind, put a CHKMEMQ; just before the call to 
> LandauKokkosJacobian and as its first line. And run with -malloc_debug. This 
> is a less optimal way to find memory corruption but may be more useful in 
> this case.
> 
> I don't seem to get anything with this, but I now see that the segv is on the 
> 2nd call to LandauKokkosJacobian, which adds the mass matrix, with shift. I 
> am working on the mass matrix part now. Let me try adding print statements in 
> LandauKokkosJacobian. (DDT failed to trace into that method, but let's see).
> 
> Thanks,
> 
>       CHKMEMQ;
>       PetscPrintf(PETSC_COMM_SELF,"call LandauKokkosJacobian\n");
>       ierr = 
> LandauKokkosJacobian(ctx->plex,Nq,Eq_m,IPf,N,xdata,ctx->SData_d,ctx->subThreadBlockSize,shift,ctx->events,JacP);CHKERRQ(ierr);
> 
> 00:37 adams/landau-mass-opt *= 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/tutorials$ 
> make PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 -f mymake tiny 
> EXTRA='-dm_mat_type aijkokkos -dm_vec_type kokkos -malloc_debug' DEVICE=kokkos
> jsrun -n 1 -c 1 -g 1 ./ex2 -dim 2 -ex2_test_type none -dm_landau_Ez 0 
> -petscspace_degree 3 -dm_preallocate_only -dm_landau_type p4est 
> -dm_landau_ion_masses 1 -dm_landau_ion_charges 1 -dm_landau_thermal_temps 4,4 
> -dm_landau_n 1,1 -ts_monitorx -snes_rtol 1.e-14 -snes_stol 1.e-14 
> -snes_monitor -snes_converged_reason -snes_max_it 14 -ts_type beuler 
> -ts_exact_final_time stepover -ts_max_snes_failures 1 -ts_rtol 5e-1 -ts_dt .5 
> -ts_max_steps 1 -pc_type lu -ksp_type preonly -dm_landau_amr_levels_max 13 
> -dm_landau_device_type kokkos -dm_mat_type aijkokkos -dm_vec_type kokkos 
> -malloc_debug
> 
> 
> [0]FormLandau: 1280 IPs, 80 cells, totDim=32, Nb=16, Nq=16, elemMatSize=1024, 
> dim=2, Tab: Nb=16 Nf=2 Np=16 cdim=2 N=1406 shift=0.
> call LandauKokkosJacobian
>     0 SNES Function norm 4.974994975313e-03
> call LandauKokkosJacobian
> [0]PETSC ERROR: 
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
> probably memory access out of range
> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [0]PETSC ERROR: or see 
> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind 
> <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
> [0]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on 
> GNU/linux and Apple Mac OS X to find memory corruption errors
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: ---------------------  Stack Frames 
> ------------------------------------
> [0]PETSC ERROR: The EXACT line numbers in the error traceback are not 
> available.
> [0]PETSC ERROR: instead the line number of the start of the function is given.
> [0]PETSC ERROR: #1 LandauKokkosJacobian() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/kokkos/landau.kokkos.cxx:272
> [0]PETSC ERROR: #2 LandauFormJacobian_Internal() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/plexland.c:66
> [0]PETSC ERROR: #3 LandauIJacobian() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/plexland.c:2093
> [0]PETSC ERROR: #4 TS user implicit Jacobian() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:933
> [0]PETSC ERROR: #5 TSComputeIJacobian() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:916
> [0]PETSC ERROR: #6 SNESTSFormJacobian_Theta() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/impls/implicit/theta/theta.c:1000
> [0]PETSC ERROR: #7 SNESTSFormJacobian() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:4407
> [0]PETSC ERROR: #8 SNES user Jacobian function() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:2823
> [0]PETSC ERROR: #9 SNESComputeJacobian() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:2782
> [0]PETSC ERROR: #10 SNESSolve() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4653
> [0]PETSC ERROR: #11 TSTheta_SNESSolve() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/impls/implicit/theta/theta.c:184
> [0]PETSC ERROR: #12 TSStep_Theta() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/impls/implicit/theta/theta.c:200
> [0]PETSC ERROR: #13 TSStep() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:3548
> [0]PETSC ERROR: --------------------- Error Message 
> --------------------------------------------------------------
>  
> 
>> On May 29, 2021, at 10:46 PM, Junchao Zhang <junchao.zh...@gmail.com 
>> <mailto:junchao.zh...@gmail.com>> wrote:
>> 
>> try gcc/6.4.0
>> --Junchao Zhang
>> 
>> 
>> On Sat, May 29, 2021 at 9:50 PM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> And I grief using gcc-8.1.1 and get this error:
>> 
>> /autofs/nccs-svm1_sw/summit/gcc/8.1.1/include/c++/8.1.1/type_traits(347): 
>> error: identifier "__ieee128" is undefined
>> 
>> Any ideas?
>> 
>> On Sat, May 29, 2021 at 10:39 PM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> And  valgrind sees this. I think the jump to the function is going to the 
>> wrong place. 
>> I'm giving up on PGI but can try newer versions of GCC. (what is the deal 
>> with the range of major releases, 4-10?)
>> (as I said this looks like an error that a user is getting so I'd like to 
>> figure it out).
>> 
>>     0 SNES Function norm 4.974994975313e-03
>> ==77820== Invalid read of size 4
>> ==77820==    at 0x7E69068: LandauKokkosJacobian (in 
>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0)
>> ==77820==    by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212)
>> ==77820==    by 0x7C728D3: LandauIJacobian (plexland.c:2107)
>> ==77820==    by 0x7C8C26B: TSComputeIJacobian (ts.c:934)
>> ==77820==    by 0x7E28337: SNESTSFormJacobian_Theta (theta.c:1007)
>> ==77820==    by 0x7CBBFD3: SNESTSFormJacobian (ts.c:4415)
>> ==77820==    by 0x7AD84BF: SNESComputeJacobian (snes.c:2824)
>> ==77820==    by 0x7BA945B: SNESSolve_NEWTONLS (ls.c:222)
>> ==77820==    by 0x7AF336F: SNESSolve (snes.c:4769)
>> ==77820==    by 0x7E19D13: TSTheta_SNESSolve (theta.c:185)
>> ==77820==    by 0x7E1A8B7: TSStep_Theta (theta.c:223)
>> ==77820==    by 0x7CB093F: TSStep (ts.c:3571)
>> ==77820==  Address 0x96fff690 is in a --- anonymous segment
>> ==77820==
>> [0]PETSC ERROR: 
>> ------------------------------------------------------------------------
>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
>> probably memory access out of range
>> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>> [0]PETSC ERROR: or see 
>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind 
>> <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
>> [0]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on 
>> GNU/linux and Apple Mac OS X to find memory corruption errors
>> [0]PETSC ERROR: likely location of problem given in stack below
>> [0]PETSC ERROR: ---------------------  Stack Frames 
>> ------------------------------------
>> [0]PETSC ERROR: The EXACT line numbers in the error traceback are not 
>> available.
>> [0]PETSC ERROR: instead the line number of the start of the function is 
>> given.
>> [0]PETSC ERROR: #1 LandauKokkosJacobian() at 
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/kokkos/landau.kokkos.cxx:272
>> 
>> On Sat, May 29, 2021 at 8:46 PM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> 
>> 
>> On Sat, May 29, 2021 at 7:48 PM Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>> 
>>    I don't see why it is not running the Kokkos check. Here is the rule 
>> right below the CUDA rule that is apparently running.
>> 
>> check_build:
>>         -@echo "Running check examples to verify correct installation"
>>         -@echo "Using PETSC_DIR=${PETSC_DIR} and PETSC_ARCH=${PETSC_ARCH}"
>>         +@cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} 
>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR} clean-legacy
>>         +@cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} 
>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR} testex19
>>         +@if [ "${HYPRE_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = "" ] &&  
>> [ "${PETSC_SCALAR}" = "real" ]; then \
>>           cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} 
>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR} 
>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_hypre; \
>>          fi;
>>         +@if [ "${CUDA_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = "" ] &&  [ 
>> "${PETSC_SCALAR}" = "real" ]; then \
>>           cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} 
>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR} 
>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_cuda; \
>>          fi;
>>         +@if [ "${KOKKOS_KERNELS_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = 
>> "" ] &&  [ "${PETSC_SCALAR}" = "real" ] && [ "${PETSC_PRECISION}" = "double" 
>> ] && [ "${MPI_IS_MPIUNI}" = "0" ]; then \
>>           cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} 
>> PETSC_ARCH=${PETSC_ARCH}  PETSC_DIR=${PETSC_DIR} 
>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex3k_kokkos; \
>>          fi;
>> 
>>   Regarding the debugging, if it is just one MPI rank (or even more) with 
>> GDB it will trap the error and show the exact line of source code where the 
>> error occurred and you can poke around at variables to see if they look 
>> corrupt or wrong (for example crazy address in a pointer), I don't know why 
>> your debugger is not giving more useful information. 
>> 
>> 
>> This is what I did (in DDT). It stopped at the function call and the data 
>> looked fine. I stepped into the call, but didn't get to it. The signal 
>> handler was called and I was dead.
>> Maybe I did something in my branch. Can't see what, but I keep probing,
>> Thanks,
>>  
>>   Barry
>> 
>> 
>> > On May 29, 2021, at 2:16 PM, Mark Adams <mfad...@lbl.gov 
>> > <mailto:mfad...@lbl.gov>> wrote:
>> > 
>> > I am running on Summit with Kokkos-CUDA and I am getting a segv that looks 
>> > like some sort of a compile/link mismatch. I also have a user with a C++ 
>> > code that is getting strange segvs when calling MatSetValues with CUDA (I 
>> > know MatSetValues is not a cupsarse method, but that is the report that I 
>> > have). I have no idea if these are related but they both involve C -- C++ 
>> > calls ...
>> > 
>> > I started with a clean build (attached) and I ran in DDT. DDT stopped at 
>> > the call in plexland.c to the KokkosLanau operator. I stepped into this 
>> > function and then took this screenshot of the stack, with the Kokkos call 
>> > and PETSc signal handler.
>> > 
>> > Make check does not seem to be running Kokkos tests:
>> > 
>> > 15:02 adams/landau-mass-opt *= /gpfs/alpine/csc314/scratch/adams/petsc$ 
>> > make PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc 
>> > PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 check
>> > Running check examples to verify correct installation
>> > Using PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc and 
>> > PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10
>> > C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
>> > C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes
>> > C/C++ example src/snes/tutorials/ex19 run successfully with cuda
>> > Completed test examples
>> > 
>> > Also, I ran this AM with another branch that had not been rebased with 
>> > main as recently as this branch (adams/landau-mass-opt).
>> > 
>> > Any ideas?
>> > <make.log><configure.log><Screen Shot 2021-05-29 at 2.51.00 PM.png>
>> 
> 

Reply via email to