On Wed 22 Jan 2014 10:54:28 AM MST, Paul Mullowney wrote:
Oh. You're opening a can of worms but maybe that's your intent ;) I see the block Jacobi preconditioner in the valgrind logs.
Didn't mean to open a can of worms.
Do, mpirun -n 1 (or 2) ./ex7 -mat_type mpiaijcusparse -vec_type mpicusp -pc_type none
This works.
From here, we can try to sort out the VecScatterInitializeForGPU problem when mpirun/exec is not used. If you want to implement block jacobi preconditioner on multiple GPUs, that's a larger problem to solve. I had some code that sort of worked. We'd have to sit down and discuss.
I'd be really interested in learning more about this. Cheers, Dominic
-Paul On Wed, Jan 22, 2014 at 10:48 AM, Dominic Meiser <[email protected] <mailto:[email protected]>> wrote: Attached are the logs with 1 rank and 2 ranks. As far as I can tell these are different errors. For the log attached to the previous email I chose to run ex7 without mpirun so that valgrind checks ex7 and not mpirun. Is there a way to have valgrind check the mpi processes rather than mpirun? Cheers, Dominic On 01/22/2014 10:37 AM, Paul Mullowney wrote:Hmmm. I may not have protected against the case where the mpaijcusp(arse) classes are called but without mpirun/mpiexec. I suppose it should have occurred to me that someone would do this. try : mpirun -n 1 ./ex7 -mat_type mpiaijcusparse -vec_type cusp In this scenario, the sequential to sequential vecscatters should be called. Then, mpirun -n 2 ../ex7 -mat_type mpiaijcusparse -vec_type cusp In this scenario, MPI_General vecscatters should be called ... and work correctly if you have a system with multiple GPUs. I -Paul On Wed, Jan 22, 2014 at 10:32 AM, Dominic Meiser <[email protected] <mailto:[email protected]>> wrote: Hey Paul, Thanks for providing background on this. On Wed 22 Jan 2014 10:05:13 AM MST, Paul Mullowney wrote: Dominic, A few years ago, I was trying to minimize the amount of data transfer to and from the GPU (for multi-GPU MatMult) by inspecting the indices of the data that needed to be message to and from the device. Then, I would call gather kernels on the GPU which pulled the scattered data into contiguous buffers and then be transferred to the host asynchronously (while the MatMult was occurring). The existence of VecScatterInitializeForGPU was added in order to build the necessary buffers as needed. This was the motivation behind the existence of VecScatterInitializeForGPU. An alternative approach is to message the smallest contiguous buffer containing all the data with a single cudaMemcpyAsync. This is the method currently implemented. I never found a case where the former implementation (with a GPU gather-kernel) performed better than the alternative approach which messaged the smallest contiguous buffer. I looked at many, many matrices. Now, as far as I understand the VecScatter kernels, this method should only get called if the transfer is MPI_General (i.e. PtoP parallel to parallel). Other VecScatter methods are called in other circumstances where the the scatter is not MPI_General. That assumption could be wrong though. I see. I figured there was some logic in place to make sure that this function only gets called in cases where the transfer type is MPI_General. I'm getting segfaults in this function where the todata and fromdata are of a different type. This could easily be user error but I'm not sure. Here is an example valgrind error: ==27781== Invalid read of size 8 ==27781== at 0x1188080: VecScatterInitializeForGPU (vscatcusp.c:46) ==27781== by 0xEEAE5D: MatMult_MPIAIJCUSPARSE(_p_Mat*, _p_Vec*, _p_Vec*) (mpiaijcusparse.cu:108 <http://mpiaijcusparse.cu:108>) ==27781== by 0xA20CC3: MatMult (matrix.c:2242) ==27781== by 0x4645E4: main (ex7.c:93) ==27781== Address 0x286305e0 is 1,616 bytes inside a block of size 1,620 alloc'd ==27781== at 0x4C26548: memalign (vg_replace_malloc.c:727) ==27781== by 0x4654F9: PetscMallocAlign(unsigned long, int, char const*, char const*, void**) (mal.c:27) ==27781== by 0xCAEECC: PetscTrMallocDefault(unsigned long, int, char const*, char const*, void**) (mtr.c:186) ==27781== by 0x5A5296: VecScatterCreate (vscat.c:1168) ==27781== by 0x9AF3C5: MatSetUpMultiply_MPIAIJ (mmaij.c:116) ==27781== by 0x96F0F0: MatAssemblyEnd_MPIAIJ(_p_Mat*, MatAssemblyType) (mpiaij.c:706) ==27781== by 0xA45358: MatAssemblyEnd (matrix.c:4959) ==27781== by 0x464301: main (ex7.c:78) This was produced by src/ksp/ksp/tutorials/ex7.c. The command line options are ./ex7 -mat_type mpiaijcusparse -vec_type cusp In this particular case the todata is of type VecScatter_Seq_Stride and fromdata is of type VecScatter_Seq_General. The complete valgrind log (including configure options for petsc) is attached. Any comments or suggestions are appreciated. Cheers, Dominic -Paul On Wed, Jan 22, 2014 at 9:49 AM, Dominic Meiser <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> wrote: Hi, I'm trying to understand VecScatterInitializeForGPU in src/vec/vec/utils/veccusp/__vscatcusp.c. I don't understand why this function can get away with casting the fromdata and todata in the inctx to VecScatter_MPI_General. Don't we need to inspect the VecScatterType fields of the todata and fromdata? Cheers, Dominic -- Dominic Meiser Tech-X Corporation 5621 Arapahoe Avenue Boulder, CO 80303 USA Telephone: 303-996-2036 <tel:303-996-2036> <tel:303-996-2036 <tel:303-996-2036>> Fax: 303-448-7756 <tel:303-448-7756> <tel:303-448-7756 <tel:303-448-7756>> www.txcorp.com <http://www.txcorp.com> <http://www.txcorp.com> -- Dominic Meiser Tech-X Corporation 5621 Arapahoe Avenue Boulder, CO 80303 USA Telephone: 303-996-2036 <tel:303-996-2036> Fax: 303-448-7756 <tel:303-448-7756> www.txcorp.com <http://www.txcorp.com>-- Dominic Meiser Tech-X Corporation 5621 Arapahoe Avenue Boulder, CO 80303 USA Telephone:303-996-2036 <tel:303-996-2036> Fax:303-448-7756 <tel:303-448-7756> www.txcorp.com <http://www.txcorp.com>
-- Dominic Meiser Tech-X Corporation 5621 Arapahoe Avenue Boulder, CO 80303 USA Telephone: 303-996-2036 Fax: 303-448-7756 www.txcorp.com
