I am stumped with this GPU bug(s). Maybe someone has an idea. I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected, but I still have differences between the GPU and CPU transpose mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh with two processors. It works on one processor or with cg/none. So it is the transpose mat-vec.
I see that the result of the off-diagonal (a->lvec) is different* only proc 1*. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat and vec and printed out matlab vectors. Below is the CPU output and then the GPU with a view of the scatter object, which is identical as you can see. The matlab B matrix and xx vector are identical. Maybe the GPU copy is wrong ... The only/first difference between CPU and GPU is a->lvec (the off diagonal contribution)on processor 1. (you can see the norms are *different*). Here is the diff on the process 1 a->lvec vector (all values are off). Any thoughts would be appreciated, Mark 15:30 1 /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m 2,12c2,12 < % type: seqcuda < Vec_0x53738630_0 = [ < 9.5702137431412879e+00 < 2.1970298791152253e+01 < 4.5422290209190646e+00 < 2.0185031807270226e+00 < 4.2627312508573375e+01 < 1.0889191983882025e+01 < 1.6038202417695462e+01 < 2.7155672033607665e+01 < 6.2540357853223556e+00 --- > % type: seq > Vec_0x3a546440_0 = [ > 4.5565851251714653e+00 > 1.0460532998971189e+01 > 2.1626531807270220e+00 > 9.6105288923182408e-01 > 2.0295782656035659e+01 > 5.1845791066529463e+00 > 7.6361340020576058e+00 > 1.2929401011659799e+01 > 2.9776812928669392e+00 15:15 130 /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56 -cells 2,2,1 [0] 27 global equations, 9 vertices [0] 27 equations in vector, 9 vertices 0 SNES Function norm 1.223958326481e+02 0 KSP Residual norm 1.223958326481e+02 [0] |x|= 1.223958326481e+02 |a->lvec|= 1.773965489475e+01 |B|= 1.424708937136e+00 [1] |x|= 1.223958326481e+02 |a->lvec|= *2.844171413778e*+01 |B|= 1.424708937136e+00 [1] 1) |yy|= 2.007423334680e+02 [0] 1) |yy|= 2.007423334680e+02 [0] 2) |yy|= 1.957605719265e+02 [1] 2) |yy|= 1.957605719265e+02 [1] Number sends = 1; Number to self = 0 [1] 0 length = 9 to whom 0 Now the indices for all remote sends (in order by process sent to) [1] 9 [1] 10 [1] 11 [1] 12 [1] 13 [1] 14 [1] 15 [1] 16 [1] 17 [1] Number receives = 1; Number from self = 0 [1] 0 length 9 from whom 0 Now the indices for all remote receives (in order by process received from) [1] 0 [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 1 KSP Residual norm 8.199932342150e+01 Linear solve did not converge due to DIVERGED_ITS iterations 1 Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0 15:19 /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56 -cells 2,2,1 *-ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda* [0] 27 global equations, 9 vertices [0] 27 equations in vector, 9 vertices 0 SNES Function norm 1.223958326481e+02 0 KSP Residual norm 1.223958326481e+02 [0] |x|= 1.223958326481e+02 |a->lvec|= 1.773965489475e+01 |B|= 1.424708937136e+00 [1] |x|= 1.223958326481e+02 |a->lvec|= *5.973624458725e*+01 |B|= 1.424708937136e+00 [0] 1) |yy|= 2.007423334680e+02 [1] 1) |yy|= 2.007423334680e+02 [0] 2) |yy|= 1.953571867633e+02 [1] 2) |yy|= 1.953571867633e+02 [1] Number sends = 1; Number to self = 0 [1] 0 length = 9 to whom 0 Now the indices for all remote sends (in order by process sent to) [1] 9 [1] 10 [1] 11 [1] 12 [1] 13 [1] 14 [1] 15 [1] 16 [1] 17 [1] Number receives = 1; Number from self = 0 [1] 0 length 9 from whom 0 Now the indices for all remote receives (in order by process received from) [1] 0 [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 1 KSP Residual norm 8.199932342150e+01