Oh. You're opening a can of worms but maybe that's your intent ;) I see the block Jacobi preconditioner in the valgrind logs.
Do, mpirun -n 1 (or 2) ./ex7 -mat_type mpiaijcusparse -vec_type mpicusp -pc_type none >From here, we can try to sort out the VecScatterInitializeForGPU problem when mpirun/exec is not used. If you want to implement block jacobi preconditioner on multiple GPUs, that's a larger problem to solve. I had some code that sort of worked. We'd have to sit down and discuss. -Paul On Wed, Jan 22, 2014 at 10:48 AM, Dominic Meiser <[email protected]> wrote: > Attached are the logs with 1 rank and 2 ranks. As far as I can tell > these are different errors. > > For the log attached to the previous email I chose to run ex7 without > mpirun so that valgrind checks ex7 and not mpirun. Is there a way to have > valgrind check the mpi processes rather than mpirun? > > Cheers, > Dominic > > > > On 01/22/2014 10:37 AM, Paul Mullowney wrote: > > Hmmm. I may not have protected against the case where the > mpaijcusp(arse) classes are called but without mpirun/mpiexec. I suppose it > should have occurred to me that someone would do this. > > try : > mpirun -n 1 ./ex7 -mat_type mpiaijcusparse -vec_type cusp > > In this scenario, the sequential to sequential vecscatters should be > called. > > Then, > mpirun -n 2 ../ex7 -mat_type mpiaijcusparse -vec_type cusp > > In this scenario, MPI_General vecscatters should be called ... and work > correctly if you have a system with multiple GPUs. > > I > > -Paul > > > On Wed, Jan 22, 2014 at 10:32 AM, Dominic Meiser <[email protected]>wrote: > >> Hey Paul, >> >> Thanks for providing background on this. >> >> >> On Wed 22 Jan 2014 10:05:13 AM MST, Paul Mullowney wrote: >> >>> >>> Dominic, >>> A few years ago, I was trying to minimize the amount of data transfer >>> to and from the GPU (for multi-GPU MatMult) by inspecting the indices >>> of the data that needed to be message to and from the device. Then, I >>> would call gather kernels on the GPU which pulled the scattered data >>> into contiguous buffers and then be transferred to the host >>> asynchronously (while the MatMult was occurring). The existence of >>> VecScatterInitializeForGPU was added in order to build the necessary >>> buffers as needed. This was the motivation behind the existence of >>> VecScatterInitializeForGPU. >>> An alternative approach is to message the smallest contiguous buffer >>> containing all the data with a single cudaMemcpyAsync. This is the >>> method currently implemented. >>> I never found a case where the former implementation (with a GPU >>> gather-kernel) performed better than the alternative approach which >>> messaged the smallest contiguous buffer. I looked at many, many matrices. >>> Now, as far as I understand the VecScatter kernels, this method should >>> only get called if the transfer is MPI_General (i.e. PtoP parallel to >>> parallel). Other VecScatter methods are called in other circumstances >>> where the the scatter is not MPI_General. That assumption could be >>> wrong though. >>> >> >> >> I see. I figured there was some logic in place to make sure that this >> function only gets called in cases where the transfer type is MPI_General. >> I'm getting segfaults in this function where the todata and fromdata are of >> a different type. This could easily be user error but I'm not sure. Here is >> an example valgrind error: >> >> ==27781== Invalid read of size 8 >> ==27781== at 0x1188080: VecScatterInitializeForGPU (vscatcusp.c:46) >> ==27781== by 0xEEAE5D: MatMult_MPIAIJCUSPARSE(_p_Mat*, _p_Vec*, _p_Vec*) ( >> mpiaijcusparse.cu:108) >> ==27781== by 0xA20CC3: MatMult (matrix.c:2242) >> ==27781== by 0x4645E4: main (ex7.c:93) >> ==27781== Address 0x286305e0 is 1,616 bytes inside a block of size 1,620 >> alloc'd >> ==27781== at 0x4C26548: memalign (vg_replace_malloc.c:727) >> ==27781== by 0x4654F9: PetscMallocAlign(unsigned long, int, char const*, >> char const*, void**) (mal.c:27) >> ==27781== by 0xCAEECC: PetscTrMallocDefault(unsigned long, int, char >> const*, char const*, void**) (mtr.c:186) >> ==27781== by 0x5A5296: VecScatterCreate (vscat.c:1168) >> ==27781== by 0x9AF3C5: MatSetUpMultiply_MPIAIJ (mmaij.c:116) >> ==27781== by 0x96F0F0: MatAssemblyEnd_MPIAIJ(_p_Mat*, MatAssemblyType) >> (mpiaij.c:706) >> ==27781== by 0xA45358: MatAssemblyEnd (matrix.c:4959) >> ==27781== by 0x464301: main (ex7.c:78) >> >> This was produced by src/ksp/ksp/tutorials/ex7.c. The command line >> options are >> >> ./ex7 -mat_type mpiaijcusparse -vec_type cusp >> >> In this particular case the todata is of type VecScatter_Seq_Stride and >> fromdata is of type VecScatter_Seq_General. The complete valgrind log >> (including configure options for petsc) is attached. >> >> Any comments or suggestions are appreciated. >> Cheers, >> Dominic >> >> >>> -Paul >>> >>> >>> On Wed, Jan 22, 2014 at 9:49 AM, Dominic Meiser <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi, >>> >>> I'm trying to understand VecScatterInitializeForGPU in >>> src/vec/vec/utils/veccusp/__vscatcusp.c. I don't understand why >>> >>> this function can get away with casting the fromdata and todata in >>> the inctx to VecScatter_MPI_General. Don't we need to inspect the >>> VecScatterType fields of the todata and fromdata? >>> >>> Cheers, >>> Dominic >>> >>> -- >>> Dominic Meiser >>> Tech-X Corporation >>> 5621 Arapahoe Avenue >>> Boulder, CO 80303 >>> USA >>> Telephone: 303-996-2036 <tel:303-996-2036> >>> Fax: 303-448-7756 <tel:303-448-7756> >>> www.txcorp.com <http://www.txcorp.com> >>> >>> >>> >> >> >> -- >> Dominic Meiser >> Tech-X Corporation >> 5621 Arapahoe Avenue >> Boulder, CO 80303 >> USA >> Telephone: 303-996-2036 >> Fax: 303-448-7756 >> www.txcorp.com >> >> > > > -- > Dominic Meiser > Tech-X Corporation > 5621 Arapahoe Avenue > Boulder, CO 80303 > USA > Telephone: 303-996-2036 > Fax: 303-448-7756www.txcorp.com > >
