Yea, I agree. Once this is working, I'll go back and split MatMultAdd, etc.
On Wed, Jul 10, 2019 at 11:16 AM Smith, Barry F. <bsm...@mcs.anl.gov> wrote: > > In the long run I would like to see smaller specialized chunks of code > (with a bit of duplication between them) instead of highly overloaded > routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple > alone, for multiple add alone and for multiple add with sparse format. > Trying to get all the cases right (performance and correctness for the > everything at once is unnecessary and risky). Having possible zero size > objects (and hence null pointers) doesn't help the complex logic > > > Barry > > > > On Jul 10, 2019, at 10:06 AM, Mark Adams <mfad...@lbl.gov> wrote: > > > > Thanks, you made several changes here, including switches with the > workvector size. I guess I should import this logic to the transpose > method(s), except for the yy==NULL branches ... > > > > MatMult_ calls MatMultAdd with yy=0, but the transpose version have > their own code. MatMultTranspose_SeqAIJCUSPARSE is very simple. > > > > Thanks again, > > Mark > > > > On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini < > stefano.zamp...@gmail.com> wrote: > > Mark, > > > > if the difference is on lvec, I suspect the bug has to do with > compressed row storage. I have fixed a similar bug in MatMult. > > you want to check cusparsestruct->workVector->size() against A->cmap->n. > > > > Stefano > > > > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev < > petsc-dev@mcs.anl.gov> ha scritto: > > > > > > On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. <bsm...@mcs.anl.gov> > wrote: > > > > ierr = VecGetLocalSize(xx,&nt);CHKERRQ(ierr); > > if (nt != A->rmap->n) > SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A > (%D) and xx (%D)",A->rmap->n,nt); > > ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr); > > ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr); > > > > So the xx on the GPU appears ok? > > > > The norm is correct and ... > > > > The a->B appears ok? > > > > yes > > > > But on process 1 the result a->lvec is wrong? > > > > yes > > > > > > How do you look at the a->lvec? Do you copy it to the CPU and print it? > > > > I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I > should copy it. Maybe I should make a CUDA version of these methods? > > > > > > ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr); > > ierr = > VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); > > ierr = > VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); > > ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr); > > > > Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? > > > > This is where I have been digging around an printing stuff. > > > > > > Are you sure the problem isn't related to the "stream business"? > > > > I don't know what that is but I have played around with adding > cudaDeviceSynchronize > > > > > > /* This multiplication sequence is different sequence > > than the CPU version. In particular, the diagonal block > > multiplication kernel is launched in one stream. Then, > > in a separate stream, the data transfers from DeviceToHost > > (with MPI messaging in between), then HostToDevice are > > launched. Once the data transfer stream is synchronized, > > to ensure messaging is complete, the MatMultAdd kernel > > is launched in the original (MatMult) stream to protect > > against race conditions. > > > > This sequence should only be called for GPU computation. */ > > > > Note this comment isn't right and appears to be cut and paste from > somewhere else, since there is no MatMult() nor MatMultAdd kernel here? > > > > Yes, I noticed this. Same as MatMult and not correct here. > > > > > > Anyway to "turn off the stream business" and see if the result is then > correct? > > > > How do you do that? I'm looking at docs on streams but not sure how its > used here. > > > > Perhaps the stream business was done correctly for MatMult() but was > never right for MatMultTranspose()? > > > > Barry > > > > BTW: Unrelated comment, the code > > > > ierr = VecSet(yy,0);CHKERRQ(ierr); > > ierr = VecCUDAGetArrayWrite(yy,&yarray);CHKERRQ(ierr); > > > > has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. > VecCUDAGetArrayWrite() requires that you ignore the values in yy and set > them all yourself so setting them to zero before calling > VecCUDAGetArrayWrite() does nothing except waste time. > > > > > > OK, I'll get rid of it. > > > > > > > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev < > petsc-dev@mcs.anl.gov> wrote: > > > > > > I am stumped with this GPU bug(s). Maybe someone has an idea. > > > > > > I did find a bug in the cuda transpose mat-vec that cuda-memcheck > detected, but I still have differences between the GPU and CPU transpose > mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh > with two processors. It works on one processor or with cg/none. So it is > the transpose mat-vec. > > > > > > I see that the result of the off-diagonal (a->lvec) is different only > proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat > and vec and printed out matlab vectors. Below is the CPU output and then > the GPU with a view of the scatter object, which is identical as you can > see. > > > > > > The matlab B matrix and xx vector are identical. Maybe the GPU copy is > wrong ... > > > > > > The only/first difference between CPU and GPU is a->lvec (the off > diagonal contribution)on processor 1. (you can see the norms are > different). Here is the diff on the process 1 a->lvec vector (all values > are off). > > > > > > Any thoughts would be appreciated, > > > Mark > > > > > > 15:30 1 /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m > > > 2,12c2,12 > > > < % type: seqcuda > > > < Vec_0x53738630_0 = [ > > > < 9.5702137431412879e+00 > > > < 2.1970298791152253e+01 > > > < 4.5422290209190646e+00 > > > < 2.0185031807270226e+00 > > > < 4.2627312508573375e+01 > > > < 1.0889191983882025e+01 > > > < 1.6038202417695462e+01 > > > < 2.7155672033607665e+01 > > > < 6.2540357853223556e+00 > > > --- > > > > % type: seq > > > > Vec_0x3a546440_0 = [ > > > > 4.5565851251714653e+00 > > > > 1.0460532998971189e+01 > > > > 2.1626531807270220e+00 > > > > 9.6105288923182408e-01 > > > > 2.0295782656035659e+01 > > > > 5.1845791066529463e+00 > > > > 7.6361340020576058e+00 > > > > 1.2929401011659799e+01 > > > > 2.9776812928669392e+00 > > > > > > 15:15 130 /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g > 1 ./ex56 -cells 2,2,1 > > > [0] 27 global equations, 9 vertices > > > [0] 27 equations in vector, 9 vertices > > > 0 SNES Function norm 1.223958326481e+02 > > > 0 KSP Residual norm 1.223958326481e+02 > > > [0] |x|= 1.223958326481e+02 |a->lvec|= 1.773965489475e+01 |B|= > 1.424708937136e+00 > > > [1] |x|= 1.223958326481e+02 |a->lvec|= 2.844171413778e+01 |B|= > 1.424708937136e+00 > > > [1] 1) |yy|= 2.007423334680e+02 > > > [0] 1) |yy|= 2.007423334680e+02 > > > [0] 2) |yy|= 1.957605719265e+02 > > > [1] 2) |yy|= 1.957605719265e+02 > > > [1] Number sends = 1; Number to self = 0 > > > [1] 0 length = 9 to whom 0 > > > Now the indices for all remote sends (in order by process sent to) > > > [1] 9 > > > [1] 10 > > > [1] 11 > > > [1] 12 > > > [1] 13 > > > [1] 14 > > > [1] 15 > > > [1] 16 > > > [1] 17 > > > [1] Number receives = 1; Number from self = 0 > > > [1] 0 length 9 from whom 0 > > > Now the indices for all remote receives (in order by process received > from) > > > [1] 0 > > > [1] 1 > > > [1] 2 > > > [1] 3 > > > [1] 4 > > > [1] 5 > > > [1] 6 > > > [1] 7 > > > [1] 8 > > > 1 KSP Residual norm 8.199932342150e+01 > > > Linear solve did not converge due to DIVERGED_ITS iterations 1 > > > Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE > iterations 0 > > > > > > > > > 15:19 /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 > ./ex56 -cells 2,2,1 -ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda > > > [0] 27 global equations, 9 vertices > > > [0] 27 equations in vector, 9 vertices > > > 0 SNES Function norm 1.223958326481e+02 > > > 0 KSP Residual norm 1.223958326481e+02 > > > [0] |x|= 1.223958326481e+02 |a->lvec|= 1.773965489475e+01 |B|= > 1.424708937136e+00 > > > [1] |x|= 1.223958326481e+02 |a->lvec|= 5.973624458725e+01 |B|= > 1.424708937136e+00 > > > [0] 1) |yy|= 2.007423334680e+02 > > > [1] 1) |yy|= 2.007423334680e+02 > > > [0] 2) |yy|= 1.953571867633e+02 > > > [1] 2) |yy|= 1.953571867633e+02 > > > [1] Number sends = 1; Number to self = 0 > > > [1] 0 length = 9 to whom 0 > > > Now the indices for all remote sends (in order by process sent to) > > > [1] 9 > > > [1] 10 > > > [1] 11 > > > [1] 12 > > > [1] 13 > > > [1] 14 > > > [1] 15 > > > [1] 16 > > > [1] 17 > > > [1] Number receives = 1; Number from self = 0 > > > [1] 0 length 9 from whom 0 > > > Now the indices for all remote receives (in order by process received > from) > > > [1] 0 > > > [1] 1 > > > [1] 2 > > > [1] 3 > > > [1] 4 > > > [1] 5 > > > [1] 6 > > > [1] 7 > > > [1] 8 > > > 1 KSP Residual norm 8.199932342150e+01 > > > > > > > > > > > -- > > Stefano > >