Attached are the logs with 1 rank and 2 ranks. As far as I can tell
these are different errors.
For the log attached to the previous email I chose to run ex7 without
mpirun so that valgrind checks ex7 and not mpirun. Is there a way to
have valgrind check the mpi processes rather than mpirun?
Cheers,
Dominic
On 01/22/2014 10:37 AM, Paul Mullowney wrote:
Hmmm. I may not have protected against the case where the
mpaijcusp(arse) classes are called but without mpirun/mpiexec. I
suppose it should have occurred to me that someone would do this.
try :
mpirun -n 1 ./ex7 -mat_type mpiaijcusparse -vec_type cusp
In this scenario, the sequential to sequential vecscatters should be
called.
Then,
mpirun -n 2 ../ex7 -mat_type mpiaijcusparse -vec_type cusp
In this scenario, MPI_General vecscatters should be called ... and
work correctly if you have a system with multiple GPUs.
I
-Paul
On Wed, Jan 22, 2014 at 10:32 AM, Dominic Meiser <[email protected]
<mailto:[email protected]>> wrote:
Hey Paul,
Thanks for providing background on this.
On Wed 22 Jan 2014 10:05:13 AM MST, Paul Mullowney wrote:
Dominic,
A few years ago, I was trying to minimize the amount of data
transfer
to and from the GPU (for multi-GPU MatMult) by inspecting the
indices
of the data that needed to be message to and from the device.
Then, I
would call gather kernels on the GPU which pulled the
scattered data
into contiguous buffers and then be transferred to the host
asynchronously (while the MatMult was occurring). The existence of
VecScatterInitializeForGPU was added in order to build the
necessary
buffers as needed. This was the motivation behind the existence of
VecScatterInitializeForGPU.
An alternative approach is to message the smallest contiguous
buffer
containing all the data with a single cudaMemcpyAsync. This is the
method currently implemented.
I never found a case where the former implementation (with a GPU
gather-kernel) performed better than the alternative approach
which
messaged the smallest contiguous buffer. I looked at many,
many matrices.
Now, as far as I understand the VecScatter kernels, this
method should
only get called if the transfer is MPI_General (i.e. PtoP
parallel to
parallel). Other VecScatter methods are called in other
circumstances
where the the scatter is not MPI_General. That assumption could be
wrong though.
I see. I figured there was some logic in place to make sure that
this function only gets called in cases where the transfer type is
MPI_General. I'm getting segfaults in this function where the
todata and fromdata are of a different type. This could easily be
user error but I'm not sure. Here is an example valgrind error:
==27781== Invalid read of size 8
==27781== at 0x1188080: VecScatterInitializeForGPU (vscatcusp.c:46)
==27781== by 0xEEAE5D: MatMult_MPIAIJCUSPARSE(_p_Mat*, _p_Vec*,
_p_Vec*) (mpiaijcusparse.cu:108 <http://mpiaijcusparse.cu:108>)
==27781== by 0xA20CC3: MatMult (matrix.c:2242)
==27781== by 0x4645E4: main (ex7.c:93)
==27781== Address 0x286305e0 is 1,616 bytes inside a block of size
1,620 alloc'd
==27781== at 0x4C26548: memalign (vg_replace_malloc.c:727)
==27781== by 0x4654F9: PetscMallocAlign(unsigned long, int, char
const*, char const*, void**) (mal.c:27)
==27781== by 0xCAEECC: PetscTrMallocDefault(unsigned long, int,
char const*, char const*, void**) (mtr.c:186)
==27781== by 0x5A5296: VecScatterCreate (vscat.c:1168)
==27781== by 0x9AF3C5: MatSetUpMultiply_MPIAIJ (mmaij.c:116)
==27781== by 0x96F0F0: MatAssemblyEnd_MPIAIJ(_p_Mat*,
MatAssemblyType) (mpiaij.c:706)
==27781== by 0xA45358: MatAssemblyEnd (matrix.c:4959)
==27781== by 0x464301: main (ex7.c:78)
This was produced by src/ksp/ksp/tutorials/ex7.c. The command line
options are
./ex7 -mat_type mpiaijcusparse -vec_type cusp
In this particular case the todata is of type
VecScatter_Seq_Stride and fromdata is of type
VecScatter_Seq_General. The complete valgrind log (including
configure options for petsc) is attached.
Any comments or suggestions are appreciated.
Cheers,
Dominic
-Paul
On Wed, Jan 22, 2014 at 9:49 AM, Dominic Meiser
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
Hi,
I'm trying to understand VecScatterInitializeForGPU in
src/vec/vec/utils/veccusp/__vscatcusp.c. I don't understand why
this function can get away with casting the fromdata and todata in
the inctx to VecScatter_MPI_General. Don't we need to inspect the
VecScatterType fields of the todata and fromdata?
Cheers,
Dominic
--
Dominic Meiser
Tech-X Corporation
5621 Arapahoe Avenue
Boulder, CO 80303
USA
Telephone: 303-996-2036 <tel:303-996-2036> <tel:303-996-2036
<tel:303-996-2036>>
Fax: 303-448-7756 <tel:303-448-7756> <tel:303-448-7756
<tel:303-448-7756>>
www.txcorp.com <http://www.txcorp.com> <http://www.txcorp.com>
--
Dominic Meiser
Tech-X Corporation
5621 Arapahoe Avenue
Boulder, CO 80303
USA
Telephone: 303-996-2036 <tel:303-996-2036>
Fax: 303-448-7756 <tel:303-448-7756>
www.txcorp.com <http://www.txcorp.com>
--
Dominic Meiser
Tech-X Corporation
5621 Arapahoe Avenue
Boulder, CO 80303
USA
Telephone: 303-996-2036
Fax: 303-448-7756
www.txcorp.com
[0]PETSC ERROR: --------------------- Error Message
------------------------------------
[0]PETSC ERROR: Null argument, when expecting valid pointer!
[0]PETSC ERROR: Trying to zero at a null pointer!
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: Petsc Development GIT revision: v3.4.3-2332-g54f71ec GIT Date:
2014-01-20 14:12:11 -0700
[0]PETSC ERROR: See docs/changes/index.html for recent updates.
[0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[0]PETSC ERROR: See docs/index.html for manual pages.
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: ./ex7 on a pargpudbg named ivy.txcorp.com by dmeiser Wed Jan 22
10:47:22 2014
[0]PETSC ERROR: Libraries linked from
/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg/lib
[0]PETSC ERROR: Configure run at Tue Jan 21 16:53:42 2014
[0]PETSC ERROR: Configure options
--with-cmake=/scr_ivy/dmeiser/PTSOLVE/cmake/bin/cmake
--prefix=/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg --with-precision=double
--with-scalar-type=real --with-fortran-kernels=1 --with-x=no --with-mpi=yes
--with-mpi-dir=/scr_ivy/dmeiser/PTSOLVE/openmpi/ --with-openmp=yes
--with-valgrind=1 --with-shared-libraries=0 --with-c-support=yes
--with-debugging=yes --with-cuda=1 --with-cuda-dir=/usr/local/cuda
--with-cuda-arch=sm_35 --download-txpetscgpu --with-thrust=yes
--with-thrust-dir=/usr/local/cuda/include --with-umfpack=yes --download-umfpack
--with-mumps=yes --with-superlu=yes --download-superlu=yes --download-mumps=yes
--download-scalapack --download-parmetis --download-metis --with-cusp=yes
--with-cusp-dir=/scr_ivy/dmeiser/PTSOLVE/cusp/include --CUDAFLAGS="-O3
-I/usr/local/cuda/include --generate-code arch=compute_20,code=sm_20
--generate-code arch=compute_20,code=sm_21 --generate-code
arch=compute_30,code=sm_30 --generate-code arch=compute_35,code=sm_35"
--with-clanguage=C++ --CFLAGS="-pipe -fPIC" --CXXFLAGS="-pipe -fPIC"
--with-c2html=0 --with-gelus=1 --with-gelus-dir=/scr_ivy/dmeiser/software/gelus
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: PetscMemzero() line 1930 in
/scr_ivy/dmeiser/petsc/include/petscsys.h
[0]PETSC ERROR: VecSet_Seq() line 729 in
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/dvec2.c
[0]PETSC ERROR: VecSet() line 575 in
/scr_ivy/dmeiser/petsc/src/vec/vec/interface/rvector.c
[0]PETSC ERROR: KSPSolve() line 417 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: PCApply_BJacobi_Multiblock() line 945 in
/scr_ivy/dmeiser/petsc/src/ksp/pc/impls/bjacobi/bjacobi.c
[0]PETSC ERROR: PCApply() line 440 in
/scr_ivy/dmeiser/petsc/src/ksp/pc/interface/precon.c
[0]PETSC ERROR: KSP_PCApply() line 227 in
/scr_ivy/dmeiser/petsc/include/petsc-private/kspimpl.h
[0]PETSC ERROR: KSPInitialResidual() line 64 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itres.c
[0]PETSC ERROR: KSPSolve_GMRES() line 234 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/impls/gmres/gmres.c
[1]PETSC ERROR: --------------------- Error Message
------------------------------------
[1]PETSC ERROR: Null argument, when expecting valid pointer!
[1]PETSC ERROR: Trying to zero at a null pointer!
[1]PETSC ERROR:
------------------------------------------------------------------------
[1]PETSC ERROR: Petsc Development GIT revision: v3.4.3-2332-g54f71ec GIT Date:
2014-01-20 14:12:11 -0700
[1]PETSC ERROR: See docs/changes/index.html for recent updates.
[1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[1]PETSC ERROR: See docs/index.html for manual pages.
[1]PETSC ERROR:
------------------------------------------------------------------------
[1]PETSC ERROR: ./ex7 on a pargpudbg named ivy.txcorp.com by dmeiser Wed Jan 22
10:47:22 2014
[1]PETSC ERROR: Libraries linked from
/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg/lib
[1]PETSC ERROR: Configure run at Tue Jan 21 16:53:42 2014
[1]PETSC ERROR: Configure options
--with-cmake=/scr_ivy/dmeiser/PTSOLVE/cmake/bin/cmake
--prefix=/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg --with-precision=double
--with-scalar-type=real --with-fortran-kernels=1 --with-x=no --with-mpi=yes
--with-mpi-dir=/scr_ivy/dmeiser/PTSOLVE/openmpi/ --with-openmp=yes
--with-valgrind=1 --with-shared-libraries=0 --with-c-support=yes
--with-debugging=yes --with-cuda=1 --with-cuda-dir=/usr/local/cuda
--with-cuda-arch=sm_35 --download-txpetscgpu --with-thrust=yes
--with-thrust-dir=/usr/local/cuda/include --with-umfpack=yes --download-umfpack
--with-mumps=yes --with-superlu=yes --download-superlu=yes --download-mumps=yes
--download-scalapack --download-parmetis --download-metis --with-cusp=yes
--with-cusp-dir=/scr_ivy/dmeiser/PTSOLVE/cusp/include --CUDAFLAGS="-O3
-I/usr/local/cuda/include --generate-code arch=compute_20,code=sm_20
--generate-code arch=compute_20,code=sm_21 --generate-code
arch=compute_30,code=sm_30 --generate-code arch=compute_35,code=sm_35"
--with-clanguage=C++ --CFLAGS="-pipe -fPIC" --CXXFLAGS="-pipe -fPIC"
--with-c2html=0 --with-gelus=1 --with-gelus-dir=/scr_ivy/dmeiser/software/gelus
[1]PETSC ERROR:
------------------------------------------------------------------------
[1]PETSC ERROR: PetscMemzero() line 1930 in
/scr_ivy/dmeiser/petsc/include/petscsys.h
[1]PETSC ERROR: VecSet_Seq() line 729 in
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/dvec2.c
[1]PETSC ERROR: VecSet() line 575 in
/scr_ivy/dmeiser/petsc/src/vec/vec/interface/rvector.c
[1]PETSC ERROR: KSPSolve() line 417 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[1]PETSC ERROR: PCApply_BJacobi_Multiblock() line 945 in
/scr_ivy/dmeiser/petsc/src/ksp/pc/impls/bjacobi/bjacobi.c
[1]PETSC ERROR: PCApply() line 440 in
/scr_ivy/dmeiser/petsc/src/ksp/pc/interface/precon.c
[0]PETSC ERROR: KSPSolve() line 432 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: main() line 209 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/examples/tutorials/ex7.c
[1]PETSC ERROR: KSP_PCApply() line 227 in
/scr_ivy/dmeiser/petsc/include/petsc-private/kspimpl.h
[1]PETSC ERROR: KSPInitialResidual() line 64 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itres.c
[1]PETSC ERROR: KSPSolve_GMRES() line 234 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/impls/gmres/gmres.c
[1]PETSC ERROR: KSPSolve() line 432 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[1]PETSC ERROR: main() line 209 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/examples/tutorials/ex7.c
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 85.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 27839 on
node ivy.txcorp.com exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[ivy.txcorp.com:27838] 1 more process has sent help message help-mpi-api.txt /
mpi-abort
[ivy.txcorp.com:27838] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages
[0]PETSC ERROR: --------------------- Error Message
------------------------------------
[0]PETSC ERROR: Error in external library!
[0]PETSC ERROR: CUSP error 61!
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: Petsc Development GIT revision: v3.4.3-2332-g54f71ec GIT Date:
2014-01-20 14:12:11 -0700
[0]PETSC ERROR: See docs/changes/index.html for recent updates.
[0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[0]PETSC ERROR: See docs/index.html for manual pages.
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: ./ex7 on a pargpudbg named ivy.txcorp.com by dmeiser Wed Jan 22
10:47:15 2014
[0]PETSC ERROR: Libraries linked from
/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg/lib
[0]PETSC ERROR: Configure run at Tue Jan 21 16:53:42 2014
[0]PETSC ERROR: Configure options
--with-cmake=/scr_ivy/dmeiser/PTSOLVE/cmake/bin/cmake
--prefix=/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg --with-precision=double
--with-scalar-type=real --with-fortran-kernels=1 --with-x=no --with-mpi=yes
--with-mpi-dir=/scr_ivy/dmeiser/PTSOLVE/openmpi/ --with-openmp=yes
--with-valgrind=1 --with-shared-libraries=0 --with-c-support=yes
--with-debugging=yes --with-cuda=1 --with-cuda-dir=/usr/local/cuda
--with-cuda-arch=sm_35 --download-txpetscgpu --with-thrust=yes
--with-thrust-dir=/usr/local/cuda/include --with-umfpack=yes --download-umfpack
--with-mumps=yes --with-superlu=yes --download-superlu=yes --download-mumps=yes
--download-scalapack --download-parmetis --download-metis --with-cusp=yes
--with-cusp-dir=/scr_ivy/dmeiser/PTSOLVE/cusp/include --CUDAFLAGS="-O3
-I/usr/local/cuda/include --generate-code arch=compute_20,code=sm_20
--generate-code arch=compute_20,code=sm_21 --generate-code
arch=compute_30,code=sm_30 --generate-code arch=compute_35,code=sm_35"
--with-clanguage=C++ --CFLAGS="-pipe -fPIC" --CXXFLAGS="-pipe -fPIC"
--with-c2html=0 --with-gelus=1 --with-gelus-dir=/scr_ivy/dmeiser/software/gelus
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: VecCUSPAllocateCheck() line 72 in
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/seqcusp/veccusp.cu
[0]PETSC ERROR: VecCUSPCopyToGPU() line 96 in
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/seqcusp/veccusp.cu
[0]PETSC ERROR: VecCUSPGetArrayReadWrite() line 1946 in
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/seqcusp/veccusp.cu
[0]PETSC ERROR: VecAXPBYPCZ_SeqCUSP() line 1507 in
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/seqcusp/veccusp.cu
[0]PETSC ERROR: VecAXPBYPCZ() line 726 in
/scr_ivy/dmeiser/petsc/src/vec/vec/interface/rvector.c
[0]PETSC ERROR: KSPSolve_BCGS() line 120 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/impls/bcgs/bcgs.c
[0]PETSC ERROR: KSPSolve() line 432 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: PCApply_BJacobi_Multiblock() line 945 in
/scr_ivy/dmeiser/petsc/src/ksp/pc/impls/bjacobi/bjacobi.c
[0]PETSC ERROR: PCApply() line 440 in
/scr_ivy/dmeiser/petsc/src/ksp/pc/interface/precon.c
[0]PETSC ERROR: KSP_PCApply() line 227 in
/scr_ivy/dmeiser/petsc/include/petsc-private/kspimpl.h
[0]PETSC ERROR: KSPInitialResidual() line 64 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itres.c
[0]PETSC ERROR: KSPSolve_GMRES() line 234 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/impls/gmres/gmres.c
[0]PETSC ERROR: KSPSolve() line 432 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: main() line 209 in
/scr_ivy/dmeiser/petsc/src/ksp/ksp/examples/tutorials/ex7.c
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 76.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 27836 on
node ivy.txcorp.com exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------