Attached are the logs with 1 rank and 2 ranks. As far as I can tell these are different errors.

For the log attached to the previous email I chose to run ex7 without mpirun so that valgrind checks ex7 and not mpirun. Is there a way to have valgrind check the mpi processes rather than mpirun?

Cheers,
Dominic


On 01/22/2014 10:37 AM, Paul Mullowney wrote:
Hmmm. I may not have protected against the case where the mpaijcusp(arse) classes are called but without mpirun/mpiexec. I suppose it should have occurred to me that someone would do this.
try :
mpirun -n 1 ./ex7 -mat_type mpiaijcusparse -vec_type cusp
In this scenario, the sequential to sequential vecscatters should be called.
Then,
mpirun -n 2 ../ex7 -mat_type mpiaijcusparse -vec_type cusp
In this scenario, MPI_General vecscatters should be called ... and work correctly if you have a system with multiple GPUs.
I
-Paul


On Wed, Jan 22, 2014 at 10:32 AM, Dominic Meiser <[email protected] <mailto:[email protected]>> wrote:

    Hey Paul,

    Thanks for providing background on this.


    On Wed 22 Jan 2014 10:05:13 AM MST, Paul Mullowney wrote:


        Dominic,
        A few years ago, I was trying to minimize the amount of data
        transfer
        to and from the GPU (for multi-GPU MatMult) by inspecting the
        indices
        of the data that needed to be message to and from the device.
        Then, I
        would call gather kernels on the GPU which pulled the
        scattered data
        into contiguous buffers and then be transferred to the host
        asynchronously (while the MatMult was occurring). The existence of
        VecScatterInitializeForGPU was added in order to build the
        necessary
        buffers as needed. This was the motivation behind the existence of
        VecScatterInitializeForGPU.
        An alternative approach is to message the smallest contiguous
        buffer
        containing all the data with a single cudaMemcpyAsync. This is the
        method currently implemented.
        I never found a case where the former implementation (with a GPU
        gather-kernel) performed better than the alternative approach
        which
        messaged the smallest contiguous buffer. I looked at many,
        many matrices.
        Now, as far as I understand the VecScatter kernels, this
        method should
        only get called if the transfer is MPI_General (i.e. PtoP
        parallel to
        parallel). Other VecScatter methods are called in other
        circumstances
        where the the scatter is not MPI_General. That assumption could be
        wrong though.



    I see. I figured there was some logic in place to make sure that
    this function only gets called in cases where the transfer type is
    MPI_General. I'm getting segfaults in this function where the
    todata and fromdata are of a different type. This could easily be
    user error but I'm not sure. Here is an example valgrind error:

    ==27781== Invalid read of size 8
    ==27781== at 0x1188080: VecScatterInitializeForGPU (vscatcusp.c:46)
    ==27781== by 0xEEAE5D: MatMult_MPIAIJCUSPARSE(_p_Mat*, _p_Vec*,
    _p_Vec*) (mpiaijcusparse.cu:108 <http://mpiaijcusparse.cu:108>)
    ==27781== by 0xA20CC3: MatMult (matrix.c:2242)
    ==27781== by 0x4645E4: main (ex7.c:93)
    ==27781== Address 0x286305e0 is 1,616 bytes inside a block of size
    1,620 alloc'd
    ==27781== at 0x4C26548: memalign (vg_replace_malloc.c:727)
    ==27781== by 0x4654F9: PetscMallocAlign(unsigned long, int, char
    const*, char const*, void**) (mal.c:27)
    ==27781== by 0xCAEECC: PetscTrMallocDefault(unsigned long, int,
    char const*, char const*, void**) (mtr.c:186)
    ==27781== by 0x5A5296: VecScatterCreate (vscat.c:1168)
    ==27781== by 0x9AF3C5: MatSetUpMultiply_MPIAIJ (mmaij.c:116)
    ==27781== by 0x96F0F0: MatAssemblyEnd_MPIAIJ(_p_Mat*,
    MatAssemblyType) (mpiaij.c:706)
    ==27781== by 0xA45358: MatAssemblyEnd (matrix.c:4959)
    ==27781== by 0x464301: main (ex7.c:78)

    This was produced by src/ksp/ksp/tutorials/ex7.c. The command line
    options are

    ./ex7 -mat_type mpiaijcusparse -vec_type cusp

    In this particular case the todata is of type
    VecScatter_Seq_Stride and fromdata is of type
    VecScatter_Seq_General. The complete valgrind log (including
    configure options for petsc) is attached.

    Any comments or suggestions are appreciated.
    Cheers,
    Dominic


        -Paul


        On Wed, Jan 22, 2014 at 9:49 AM, Dominic Meiser
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

        Hi,

        I'm trying to understand VecScatterInitializeForGPU in
        src/vec/vec/utils/veccusp/__vscatcusp.c. I don't understand why

        this function can get away with casting the fromdata and todata in
        the inctx to VecScatter_MPI_General. Don't we need to inspect the
        VecScatterType fields of the todata and fromdata?

        Cheers,
        Dominic

-- Dominic Meiser
        Tech-X Corporation
        5621 Arapahoe Avenue
        Boulder, CO 80303
        USA
        Telephone: 303-996-2036 <tel:303-996-2036> <tel:303-996-2036
        <tel:303-996-2036>>
        Fax: 303-448-7756 <tel:303-448-7756> <tel:303-448-7756
        <tel:303-448-7756>>
        www.txcorp.com <http://www.txcorp.com> <http://www.txcorp.com>





-- Dominic Meiser
    Tech-X Corporation
    5621 Arapahoe Avenue
    Boulder, CO 80303
    USA
    Telephone: 303-996-2036 <tel:303-996-2036>
    Fax: 303-448-7756 <tel:303-448-7756>
    www.txcorp.com <http://www.txcorp.com>




--
Dominic Meiser
Tech-X Corporation
5621 Arapahoe Avenue
Boulder, CO 80303
USA
Telephone: 303-996-2036
Fax: 303-448-7756
www.txcorp.com

[0]PETSC ERROR: --------------------- Error Message 
------------------------------------
[0]PETSC ERROR: Null argument, when expecting valid pointer!
[0]PETSC ERROR: Trying to zero at a null pointer!
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: Petsc Development GIT revision: v3.4.3-2332-g54f71ec  GIT Date: 
2014-01-20 14:12:11 -0700
[0]PETSC ERROR: See docs/changes/index.html for recent updates.
[0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[0]PETSC ERROR: See docs/index.html for manual pages.
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: ./ex7 on a pargpudbg named ivy.txcorp.com by dmeiser Wed Jan 22 
10:47:22 2014
[0]PETSC ERROR: Libraries linked from 
/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg/lib
[0]PETSC ERROR: Configure run at Tue Jan 21 16:53:42 2014
[0]PETSC ERROR: Configure options 
--with-cmake=/scr_ivy/dmeiser/PTSOLVE/cmake/bin/cmake 
--prefix=/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg --with-precision=double 
--with-scalar-type=real --with-fortran-kernels=1 --with-x=no --with-mpi=yes 
--with-mpi-dir=/scr_ivy/dmeiser/PTSOLVE/openmpi/ --with-openmp=yes 
--with-valgrind=1 --with-shared-libraries=0 --with-c-support=yes 
--with-debugging=yes --with-cuda=1 --with-cuda-dir=/usr/local/cuda 
--with-cuda-arch=sm_35 --download-txpetscgpu --with-thrust=yes 
--with-thrust-dir=/usr/local/cuda/include --with-umfpack=yes --download-umfpack 
--with-mumps=yes --with-superlu=yes --download-superlu=yes --download-mumps=yes 
--download-scalapack --download-parmetis --download-metis --with-cusp=yes 
--with-cusp-dir=/scr_ivy/dmeiser/PTSOLVE/cusp/include --CUDAFLAGS="-O3 
-I/usr/local/cuda/include   --generate-code arch=compute_20,code=sm_20   
--generate-code arch=compute_20,code=sm_21   --generate-code 
arch=compute_30,code=sm_30   --generate-code arch=compute_35,code=sm_35" 
--with-clanguage=C++ --CFLAGS="-pipe -fPIC" --CXXFLAGS="-pipe -fPIC" 
--with-c2html=0 --with-gelus=1 --with-gelus-dir=/scr_ivy/dmeiser/software/gelus
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: PetscMemzero() line 1930 in 
/scr_ivy/dmeiser/petsc/include/petscsys.h
[0]PETSC ERROR: VecSet_Seq() line 729 in 
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/dvec2.c
[0]PETSC ERROR: VecSet() line 575 in 
/scr_ivy/dmeiser/petsc/src/vec/vec/interface/rvector.c
[0]PETSC ERROR: KSPSolve() line 417 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: PCApply_BJacobi_Multiblock() line 945 in 
/scr_ivy/dmeiser/petsc/src/ksp/pc/impls/bjacobi/bjacobi.c
[0]PETSC ERROR: PCApply() line 440 in 
/scr_ivy/dmeiser/petsc/src/ksp/pc/interface/precon.c
[0]PETSC ERROR: KSP_PCApply() line 227 in 
/scr_ivy/dmeiser/petsc/include/petsc-private/kspimpl.h
[0]PETSC ERROR: KSPInitialResidual() line 64 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itres.c
[0]PETSC ERROR: KSPSolve_GMRES() line 234 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/impls/gmres/gmres.c
[1]PETSC ERROR: --------------------- Error Message 
------------------------------------
[1]PETSC ERROR: Null argument, when expecting valid pointer!
[1]PETSC ERROR: Trying to zero at a null pointer!
[1]PETSC ERROR: 
------------------------------------------------------------------------
[1]PETSC ERROR: Petsc Development GIT revision: v3.4.3-2332-g54f71ec  GIT Date: 
2014-01-20 14:12:11 -0700
[1]PETSC ERROR: See docs/changes/index.html for recent updates.
[1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[1]PETSC ERROR: See docs/index.html for manual pages.
[1]PETSC ERROR: 
------------------------------------------------------------------------
[1]PETSC ERROR: ./ex7 on a pargpudbg named ivy.txcorp.com by dmeiser Wed Jan 22 
10:47:22 2014
[1]PETSC ERROR: Libraries linked from 
/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg/lib
[1]PETSC ERROR: Configure run at Tue Jan 21 16:53:42 2014
[1]PETSC ERROR: Configure options 
--with-cmake=/scr_ivy/dmeiser/PTSOLVE/cmake/bin/cmake 
--prefix=/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg --with-precision=double 
--with-scalar-type=real --with-fortran-kernels=1 --with-x=no --with-mpi=yes 
--with-mpi-dir=/scr_ivy/dmeiser/PTSOLVE/openmpi/ --with-openmp=yes 
--with-valgrind=1 --with-shared-libraries=0 --with-c-support=yes 
--with-debugging=yes --with-cuda=1 --with-cuda-dir=/usr/local/cuda 
--with-cuda-arch=sm_35 --download-txpetscgpu --with-thrust=yes 
--with-thrust-dir=/usr/local/cuda/include --with-umfpack=yes --download-umfpack 
--with-mumps=yes --with-superlu=yes --download-superlu=yes --download-mumps=yes 
--download-scalapack --download-parmetis --download-metis --with-cusp=yes 
--with-cusp-dir=/scr_ivy/dmeiser/PTSOLVE/cusp/include --CUDAFLAGS="-O3 
-I/usr/local/cuda/include   --generate-code arch=compute_20,code=sm_20   
--generate-code arch=compute_20,code=sm_21   --generate-code 
arch=compute_30,code=sm_30   --generate-code arch=compute_35,code=sm_35" 
--with-clanguage=C++ --CFLAGS="-pipe -fPIC" --CXXFLAGS="-pipe -fPIC" 
--with-c2html=0 --with-gelus=1 --with-gelus-dir=/scr_ivy/dmeiser/software/gelus
[1]PETSC ERROR: 
------------------------------------------------------------------------
[1]PETSC ERROR: PetscMemzero() line 1930 in 
/scr_ivy/dmeiser/petsc/include/petscsys.h
[1]PETSC ERROR: VecSet_Seq() line 729 in 
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/dvec2.c
[1]PETSC ERROR: VecSet() line 575 in 
/scr_ivy/dmeiser/petsc/src/vec/vec/interface/rvector.c
[1]PETSC ERROR: KSPSolve() line 417 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[1]PETSC ERROR: PCApply_BJacobi_Multiblock() line 945 in 
/scr_ivy/dmeiser/petsc/src/ksp/pc/impls/bjacobi/bjacobi.c
[1]PETSC ERROR: PCApply() line 440 in 
/scr_ivy/dmeiser/petsc/src/ksp/pc/interface/precon.c
[0]PETSC ERROR: KSPSolve() line 432 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: main() line 209 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/examples/tutorials/ex7.c
[1]PETSC ERROR: KSP_PCApply() line 227 in 
/scr_ivy/dmeiser/petsc/include/petsc-private/kspimpl.h
[1]PETSC ERROR: KSPInitialResidual() line 64 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itres.c
[1]PETSC ERROR: KSPSolve_GMRES() line 234 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/impls/gmres/gmres.c
[1]PETSC ERROR: KSPSolve() line 432 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[1]PETSC ERROR: main() line 209 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/examples/tutorials/ex7.c
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 85.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 27839 on
node ivy.txcorp.com exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[ivy.txcorp.com:27838] 1 more process has sent help message help-mpi-api.txt / 
mpi-abort
[ivy.txcorp.com:27838] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages
[0]PETSC ERROR: --------------------- Error Message 
------------------------------------
[0]PETSC ERROR: Error in external library!
[0]PETSC ERROR: CUSP error 61!
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: Petsc Development GIT revision: v3.4.3-2332-g54f71ec  GIT Date: 
2014-01-20 14:12:11 -0700
[0]PETSC ERROR: See docs/changes/index.html for recent updates.
[0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[0]PETSC ERROR: See docs/index.html for manual pages.
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: ./ex7 on a pargpudbg named ivy.txcorp.com by dmeiser Wed Jan 22 
10:47:15 2014
[0]PETSC ERROR: Libraries linked from 
/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg/lib
[0]PETSC ERROR: Configure run at Tue Jan 21 16:53:42 2014
[0]PETSC ERROR: Configure options 
--with-cmake=/scr_ivy/dmeiser/PTSOLVE/cmake/bin/cmake 
--prefix=/scr_ivy/dmeiser/petsc-gpu-dev/build/pargpudbg --with-precision=double 
--with-scalar-type=real --with-fortran-kernels=1 --with-x=no --with-mpi=yes 
--with-mpi-dir=/scr_ivy/dmeiser/PTSOLVE/openmpi/ --with-openmp=yes 
--with-valgrind=1 --with-shared-libraries=0 --with-c-support=yes 
--with-debugging=yes --with-cuda=1 --with-cuda-dir=/usr/local/cuda 
--with-cuda-arch=sm_35 --download-txpetscgpu --with-thrust=yes 
--with-thrust-dir=/usr/local/cuda/include --with-umfpack=yes --download-umfpack 
--with-mumps=yes --with-superlu=yes --download-superlu=yes --download-mumps=yes 
--download-scalapack --download-parmetis --download-metis --with-cusp=yes 
--with-cusp-dir=/scr_ivy/dmeiser/PTSOLVE/cusp/include --CUDAFLAGS="-O3 
-I/usr/local/cuda/include   --generate-code arch=compute_20,code=sm_20   
--generate-code arch=compute_20,code=sm_21   --generate-code 
arch=compute_30,code=sm_30   --generate-code arch=compute_35,code=sm_35" 
--with-clanguage=C++ --CFLAGS="-pipe -fPIC" --CXXFLAGS="-pipe -fPIC" 
--with-c2html=0 --with-gelus=1 --with-gelus-dir=/scr_ivy/dmeiser/software/gelus
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: VecCUSPAllocateCheck() line 72 in 
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/seqcusp/veccusp.cu
[0]PETSC ERROR: VecCUSPCopyToGPU() line 96 in 
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/seqcusp/veccusp.cu
[0]PETSC ERROR: VecCUSPGetArrayReadWrite() line 1946 in 
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/seqcusp/veccusp.cu
[0]PETSC ERROR: VecAXPBYPCZ_SeqCUSP() line 1507 in 
/scr_ivy/dmeiser/petsc/src/vec/vec/impls/seq/seqcusp/veccusp.cu
[0]PETSC ERROR: VecAXPBYPCZ() line 726 in 
/scr_ivy/dmeiser/petsc/src/vec/vec/interface/rvector.c
[0]PETSC ERROR: KSPSolve_BCGS() line 120 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/impls/bcgs/bcgs.c
[0]PETSC ERROR: KSPSolve() line 432 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: PCApply_BJacobi_Multiblock() line 945 in 
/scr_ivy/dmeiser/petsc/src/ksp/pc/impls/bjacobi/bjacobi.c
[0]PETSC ERROR: PCApply() line 440 in 
/scr_ivy/dmeiser/petsc/src/ksp/pc/interface/precon.c
[0]PETSC ERROR: KSP_PCApply() line 227 in 
/scr_ivy/dmeiser/petsc/include/petsc-private/kspimpl.h
[0]PETSC ERROR: KSPInitialResidual() line 64 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itres.c
[0]PETSC ERROR: KSPSolve_GMRES() line 234 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/impls/gmres/gmres.c
[0]PETSC ERROR: KSPSolve() line 432 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: main() line 209 in 
/scr_ivy/dmeiser/petsc/src/ksp/ksp/examples/tutorials/ex7.c
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 76.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 27836 on
node ivy.txcorp.com exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Reply via email to