[petsc-dev] [petsc-maint #72279] PETSc and multigpu

Alexander Grayver Wed, 11 May 2011 00:59:15 +0200

Hi Victor,

Thanks a lot!
What should we do to get new version?


Regards,
Alexander

On 10.05.2011 02:02, Victor Minden wrote:
> I believe I've resolved this issue.
>
> Cheers,
>
> Victor
> ---
> Victor L. Minden
>
> Tufts University
> School of Engineering
> Class of 2012
>
>
> On Sun, May 8, 2011 at 5:26 PM, Victor Minden <victorminden at gmail.com 
> <mailto:victorminden at gmail.com>> wrote:
>
>     Barry,
>
>     I can verify this on breadboard now,
>
>     with two processes, cuda
>
>     minden at bb45:~/petsc-dev/src/snes/examples/tutorials$
>     /home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra -machinefile
>     /home/balay/machinefile -n 2 ./ex47cu -da_grid_x 65535 -log_summary
>     -snes_monitor -ksp_monitor -da_vec_type cusp
>      0 SNES Function norm 3.906279802209e-03
>        0 KSP Residual norm 5.994156809227e+00
>        1 KSP Residual norm 5.927247846249e-05
>      1 SNES Function norm 3.906225077938e-03
>        0 KSP Residual norm 5.993813868985e+00
>        1 KSP Residual norm 5.927575078206e-05
>     terminate called after throwing an instance of
>     'thrust::system::system_error'
>      what():  invalid device pointer
>     terminate called after throwing an instance of
>     'thrust::system::system_error'
>      what():  invalid device pointer
>     Aborted (signal 6)
>
>
>
>     Without cuda
>
>     minden at bb45:~/petsc-dev/src/snes/examples/tutorials$
>     /home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra -machinefile
>     /home/balay/machinefile -n 2 ./ex47cu -da_grid_x 65535 -log_summary
>     -snes_monitor -ksp_monitor
>      0 SNES Function norm 3.906279802209e-03
>        0 KSP Residual norm 5.994156809227e+00
>        1 KSP Residual norm 3.538158441448e-04
>        2 KSP Residual norm 3.124431921666e-04
>        3 KSP Residual norm 4.109213410989e-06
>      1 SNES Function norm 7.201017610490e-04
>        0 KSP Residual norm 3.317803708316e-02
>        1 KSP Residual norm 2.447380361169e-06
>        2 KSP Residual norm 2.164193969957e-06
>        3 KSP Residual norm 2.124317398679e-08
>      2 SNES Function norm 1.7196789348 <tel:1.7196789348>25e-05
>        0 KSP Residual norm 1.6515864531 <tel:1.6515864531>43e-06
>        1 KSP Residual norm 2.037037536868e-08
>        2 KSP Residual norm 1.109736798274e-08
>        3 KSP Residual norm 1.8572187721 <tel:1.8572187721>56e-12
>      3 SNES Function norm 1.159391068583e-09
>        0 KSP Residual norm 3.116936044619e-11
>        1 KSP Residual norm 1.366503312678e-12
>        2 KSP Residual norm 6.598477672192e-13
>        3 KSP Residual norm 5.306147277879e-17
>      4 SNES Function norm 2.202297235559e-10
>
>     Note the repeated norms when using cuda.  Looks like I'll have to take
>     a closer look at this.
>
>     -Victor
>
>     ---
>     Victor L. Minden
>
>     Tufts University
>     School of Engineering
>     Class of 2012
>
>
>
>     On Thu, May 5, 2011 at 2:57 PM, Barry Smith <bsmith at mcs.anl.gov
>     <mailto:bsmith at mcs.anl.gov>> wrote:
>     >
>     > Alexander
>     >
>     >    Thank you for the sample code; it will be very useful.
>     >
>     >    We have run parallel jobs with CUDA where each node has only
>     a single MPI process and uses a single GPU without the crash that
>     you get below. I cannot explain why it would not work in your
>     situation. Do you have access to two nodes each with a GPU so you
>     could try that?
>     >
>     >   It is crashing in a delete of a
>     >
>     > struct  _p_PetscCUSPIndices {
>     >  CUSPINTARRAYCPU indicesCPU;
>     >  CUSPINTARRAYGPU indicesGPU;
>     > };
>     >
>     > where cusp::array1d<PetscInt,cusp::device_memory>
>     >
>     > thus it is crashing after it has completed actually doing the
>     computation. If you run with -snes_monitor -ksp_monitor with and
>     without the -da_vec_type cusp on 2 processes what do you get for
>     output in the two cases? I want to see if it is running correctly
>     on two processes?
>     >
>     > Could the crash be due to memory corruption sometime doing the
>     computation?
>     >
>     >
>     >   Barry
>     >
>     >
>     >
>     >
>     >
>     > On May 5, 2011, at 3:38 AM, Alexander Grayver wrote:
>     >
>     >> Hello!
>     >>
>     >> We work with petsc-dev branch and ex47cu.cu <http://ex47cu.cu>
>     example. Our platform is
>     >> Intel Quad processor and 8 identical Tesla GPUs. CUDA 3.2
>     toolkit is
>     >> installed.
>     >> Ideally we would like to make petsc working in a multi-GPU way
>     within
>     >> just one node so that different GPUs could be attached to different
>     >> processes.
>     >> Since it's not possible within current PETSc implementation we
>     created a
>     >> preload library (see LD_PRELOAD for details) for CUBLAS function
>     >> cublasInit().
>     >> When PETSc calls this function our library gets control and we
>     assign
>     >> GPUs according to rank within MPI communicator, then we call
>     original
>     >> cublasInit().
>     >> This preload library is very simple, see petsc_mgpu.c attached.
>     >> This trick makes each process to have its own context and
>     ideally all
>     >> computations should be distributed over several GPUs.
>     >>
>     >> We managed to build petsc and example (see makefile attached)
>     and we
>     >> tested it as follows:
>     >>
>     >> [agraiver at tesla-cmc new]$ ./lapexp -da_grid_x 65535 -info >
>     cpu_1process.out
>     >> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x
>     65535 -info >
>     >> cpu_2processes.out
>     >> [agraiver at tesla-cmc new]$ ./lapexp -da_grid_x 65535
>     -da_vec_type cusp
>     >> -info > gpu_1process.out
>     >> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
>     >> -da_vec_type cusp -info > gpu_2processes.out
>     >>
>     >> Everything except last configuration works well. The last one
>     crashes
>     >> with the following exception and callstack:
>     >> terminate called after throwing an instance of
>     >> 'thrust::system::system_error'
>     >>   what():  invalid device pointer
>     >> [tesla-cmc:15549] *** Process received signal ***
>     >> [tesla-cmc:15549] Signal: Aborted (6)
>     >> [tesla-cmc:15549] Signal code:  (-6)
>     >> [tesla-cmc:15549] [ 0] /lib64/libpthread.so.0() [0x3de540eeb0]
>     >> [tesla-cmc:15549] [ 1] /lib64/libc.so.6(gsignal+0x35)
>     [0x3de50330c5]
>     >> [tesla-cmc:15549] [ 2] /lib64/libc.so.6(abort+0x186) [0x3de5034a76]
>     >> [tesla-cmc:15549] [ 3]
>     >>
>     
> /opt/llvm/dragonegg/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)
>     >> [0x7f0d3530b95d]
>     >> [tesla-cmc:15549] [ 4]
>     >> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7b76) [0x7f0d35309b76]
>     >> [tesla-cmc:15549] [ 5]
>     >> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7ba3) [0x7f0d35309ba3]
>     >> [tesla-cmc:15549] [ 6]
>     >> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7cae) [0x7f0d35309cae]
>     >> [tesla-cmc:15549] [ 7]
>     >>
>     
> ./lapexp(_ZN6thrust6detail6device4cuda4freeILj0EEEvNS_10device_ptrIvEE+0x69)
>     >> [0x426320]
>     >> [tesla-cmc:15549] [ 8]
>     >>
>     
> ./lapexp(_ZN6thrust6detail6device8dispatch4freeILj0EEEvNS_10device_ptrIvEENS0_21cuda_device_space_tagE+0x2b)
>     >> [0x4258b2]
>     >> [tesla-cmc:15549] [ 9]
>     >> ./lapexp(_ZN6thrust11device_freeENS_10device_ptrIvEE+0x2f)
>     [0x424f78]
>     >> [tesla-cmc:15549] [10]
>     >>
>     
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust23device_malloc_allocatorIiE10deallocateENS_10device_ptrIiEEm+0x33)
>     >> [0x7f0d36aeacff]
>     >> [tesla-cmc:15549] [11]
>     >>
>     
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEE10deallocateEv+0x6e)
>     >> [0x7f0d36ae8e78]
>     >> [tesla-cmc:15549] [12]
>     >>
>     
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEED1Ev+0x19)
>     >> [0x7f0d36ae75f7]
>     >> [tesla-cmc:15549] [13]
>     >>
>     
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail11vector_baseIiNS_23device_malloc_allocatorIiEEED1Ev+0x52)
>     >> [0x7f0d36ae65f4]
>     >> [tesla-cmc:15549] [14]
>     >>
>     
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN4cusp7array1dIiN6thrust6detail21cuda_device_space_tagEED1Ev+0x18)
>     >> [0x7f0d36ae5c2e]
>     >> [tesla-cmc:15549] [15]
>     >>
>     /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN19_p_PetscCUSPIndicesD1Ev+0x1d)
>     [0x7f0d3751e45f]
>     >> [tesla-cmc:15549] [16]
>     >>
>     /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(PetscCUSPIndicesDestroy+0x20f)
>     >> [0x7f0d3750c840]
>     >> [tesla-cmc:15549] [17]
>     >>
>     /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy_PtoP+0x1bc8)
>     >> [0x7f0d375af8af]
>     >> [tesla-cmc:15549] [18]
>     >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy+0x586)
>     >> [0x7f0d375e9ddf]
>     >> [tesla-cmc:15549] [19]
>     >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy_MPIAIJ+0x49f)
>     >> [0x7f0d37191d24]
>     >> [tesla-cmc:15549] [20]
>     >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy+0x546)
>     [0x7f0d370d54fe]
>     >> [tesla-cmc:15549] [21]
>     >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESReset+0x5d1)
>     [0x7f0d3746fac3]
>     >> [tesla-cmc:15549] [22]
>     >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESDestroy+0x4b8)
>     [0x7f0d37470210]
>     >> [tesla-cmc:15549] [23] ./lapexp(main+0x5ed) [0x420745]
>     >>
>     >> I've sent all detailed output files for different execution
>     >> configuration listed above as well as configure.log and make.log to
>     >> petsc-maint at mcs.anl.gov <mailto:petsc-maint at mcs.anl.gov> hoping
>     that someone could recognize the problem.
>     >> Now we have one node with multi-GPU, but I'm also wondering if
>     someone
>     >> really tested usage of GPU functionality over several nodes
>     with one GPU
>     >> each?
>     >>
>     >> Regards,
>     >> Alexander
>     >>
>     >> <petsc_mgpu.c><makefile.txt><configure.log>
>     >
>     >
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110511/6101c621/attachment.html>

[petsc-dev] [petsc-maint #72279] PETSc and multigpu

Reply via email to