I previously thought that a custom Red-Black GS solution would be to specific 
for PETSc as it will only work for star-shaped stencils of width 1.

For that specific operation, a custom kernel would probably be able to achieve 
significantly better performance as it would not have to load a mask from 
memory and one can make use of the very regular structure of the problem. We 
still wanted to use cuSPARSE here because it seemed to be more in line with the 
present GPU support of PETSc and using custom kernels in one library might be 
seen as unfair when comparing it to others.

But if you are interested anyway, I will share my implementation if/when I get 
to implementing that.

Best,
Paul Grosse-Bley

On Wednesday, March 01, 2023 16:51 CET, Barry Smith <[email protected]> wrote:
   On Mar 1, 2023, at 10:30 AM, Paul Grosse-Bley 
<[email protected]> wrote: Thank you for the detailed 
answer, Barry. I had hit a deadend on my side.
If you wish to compare, for example, ex45.c with a code that does not 
incorporate the Dirichlet boundary nodes in the linear system you can just use 
0 boundary conditions for both codes.Do you mean to implement the boundary 
conditions explicitly in e.g. hpgmg-cuda instead of using the ghosted cells for 
them?    I don't know anything about hpgmg-cuda and what it means by "ghosted 
cells".  I am just saying I think it is reasonable to use the style of ex45.c 
that retains Dirichlet unknowns in the global matrix (with zero on the those 
points) in PETSc to compare with other codes that may or may not do something 
different. But you need to use zero for Dirichlet points to ensure that the 
"funny" convergence rates do not happen at the beginning making the comparison 
unbalanced between the two codes. 
Do I go right in the assumption that the PCMG coarsening (using DMDAs geometric 
information) will cause the boundary condition on the coarser grids to be 
finite (>0)?   In general even if the Dirichlet points on the fine grid are 
zero, during PCMG with DMDA as in ex45.c yes those "boundary" values on the 
coarser grid may end up during the iterative process as non-zero.  But this is 
"harmless", just part of the algorithm but it does mean the convergence with a 
code that does not include those points will be different (not necessarily 
better or worse, just different). 
Ideally I would like to just use some kind of GPU-parallel (colored) 
SOR/Gauss-Seidel instead of Jacobi. One can relatively easily implement 
Red-Black GS using cuSPARSE's masked matrix vector products, but I have not 
found any information on implementing a custom preconditioner in PETSc.   You 
can use https://petsc.org/release/docs/manualpages/PC/PCSHELL/ You can look at 
src/ksp/pc/impls/jacobi.c for detailed comments on what goes into a 
preconditioner object.  If you write such a GPU-parallel (colored) 
SOR/Gauss-Seidel  we would love to include it in PETSc. Note also 
https://petsc.org/release/docs/manualpages/KSP/KSPCHEBYSHEV/ and potentially 
other "polynomial preconditioners" are an alternative approach for having more 
powerful parallel smoothers.    
Best,
Paul Grosse-Bley

On Wednesday, March 01, 2023 05:38 CET, Barry Smith <[email protected]> wrote:
      Ok, here is the situation. The command line options as given do not 
result in multigrid quality convergence in any of the runs; the error 
contraction factor is around .94 (meaning that for the modes that the multigrid 
algorithm does the worst on it only removes about 6 percent of them per 
iteration).     But this is hidden by the initial right hand side for the 
linear system as written in ex45.c which has O(h) values on the boundary nodes 
and O(h^3) values on the interior nodes. The first iterations are largely 
working on the boundary residual and making great progress attacking that so 
that it looks like the one has a good error contraction factor. One then sees 
the error contraction factor start to get worse and worse for the later 
iterations. With the 0 on the boundary the iterations quickly get to the bad 
regime where the error contraction factor is near one. One can see this by 
using a -ksp_rtol 1.e-12 and having the MG code print the residual decrease for 
each iteration. Thought it appears the 0 boundary condition one converges much 
slower (since it requires many more iterations) if you factor out the huge 
advantage of the nonzero boundary condition case at the beginning (in terms of 
decreasing the residual) you see they both have an asymptotic error contraction 
factor of around .94 (which is horrible for multigrid).    I now add 
-mg_levels_ksp_richardson_scale .9 -mg_coarse_ksp_richardson_scale .9 and rerun 
the two cases (nonzero and zero boundary right hand side) they take 35 and 41 
iterations (much better) initial residual norm 14.6993next residual norm 
0.84167 0.0572591next residual norm 0.0665392 0.00452668next residual norm 
0.0307273 0.00209039next residual norm 0.0158949 0.00108134next residual norm 
0.00825189 0.000561378next residual norm 0.00428474 0.000291492next residual 
norm 0.00222482 0.000151355next residual norm 0.00115522 7.85898e-05next 
residual norm 0.000599836 4.0807e-05next residual norm 0.000311459 
2.11887e-05next residual norm 0.000161722 1.1002e-05next residual norm 
8.39727e-05 5.71269e-06next residual norm 4.3602e-05 2.96626e-06next residual 
norm 2.26399e-05 1.5402e-06next residual norm 1.17556e-05 7.99735e-07next 
residual norm 6.10397e-06 4.15255e-07next residual norm 3.16943e-06 
2.15617e-07next residual norm 1.64569e-06 1.11957e-07next residual norm 
8.54511e-07 5.81326e-08next residual norm 4.43697e-07 3.01848e-08next residual 
norm 2.30385e-07 1.56732e-08next residual norm 1.19625e-07 8.13815e-09next 
residual norm 6.21143e-08 4.22566e-09next residual norm 3.22523e-08 
2.19413e-09next residual norm 1.67467e-08 1.13928e-09next residual norm 
8.69555e-09 5.91561e-10next residual norm 4.51508e-09 3.07162e-10next residual 
norm 2.34441e-09 1.59491e-10next residual norm 1.21731e-09 8.28143e-11next 
residual norm 6.32079e-10 4.30005e-11next residual norm 3.28201e-10 
2.23276e-11next residual norm 1.70415e-10 1.15934e-11next residual norm 
8.84865e-11 6.01976e-12next residual norm 4.59457e-11 3.1257e-12next residual 
norm 2.38569e-11 1.62299e-12next residual norm 1.23875e-11 8.42724e-13Linear 
solve converged due to CONVERGED_RTOL iterations 35Residual norm 1.23875e-11 
initial residual norm 172.601next residual norm 154.803 0.896887next residual 
norm 66.9409 0.387837next residual norm 34.4572 0.199636next residual norm 
17.8836 0.103612next residual norm 9.28582 0.0537995next residual norm 4.82161 
0.027935next residual norm 2.50358 0.014505next residual norm 1.29996 
0.0075316next residual norm 0.674992 0.00391071next residual norm 0.350483 
0.0020306next residual norm 0.181985 0.00105437next residual norm 0.094494 
0.000547472next residual norm 0.0490651 0.000284269next residual norm 0.0254766 
0.000147604next residual norm 0.0132285 7.6642e-05next residual norm 0.00686876 
3.97956e-05next residual norm 0.00356654 2.06635e-05next residual norm 
0.00185189 1.07293e-05next residual norm 0.000961576 5.5711e-06next residual 
norm 0.000499289 2.89274e-06next residual norm 0.000259251 1.50203e-06next 
residual norm 0.000134614 7.79914e-07next residual norm 6.98969e-05 
4.04963e-07next residual norm 3.62933e-05 2.10273e-07next residual norm 
1.88449e-05 1.09182e-07next residual norm 9.78505e-06 5.66919e-08next residual 
norm 5.0808e-06 2.94367e-08next residual norm 2.63815e-06 1.52847e-08next 
residual norm 1.36984e-06 7.93645e-09next residual norm 7.11275e-07 
4.12093e-09next residual norm 3.69322e-07 2.13975e-09next residual norm 
1.91767e-07 1.11105e-09next residual norm 9.95733e-08 5.769e-10next residual 
norm 5.17024e-08 2.99549e-10next residual norm 2.6846e-08 1.55538e-10next 
residual norm 1.39395e-08 8.07615e-11next residual norm 7.23798e-09 
4.19348e-11next residual norm 3.75824e-09 2.17742e-11next residual norm 
1.95138e-09 1.13058e-11next residual norm 1.01327e-09 5.87059e-12next residual 
norm 5.26184e-10 3.04856e-12next residual norm 2.73182e-10 1.58274e-12next 
residual norm 1.41806e-10 8.21586e-13Linear solve converged due to 
CONVERGED_RTOL iterations 42Residual norm 1.41806e-10 Notice in the first run 
the residual norm still dives much more quickly for the first 2 iterations than 
the second run. This is because the first run has "lucky error" that gets wiped 
out easily from the big boundary term. After that you can see that the 
convergence for both is very similar with both having a reasonable error 
contraction factor of .51 I' ve attached the modified src/ksp/pc/impls/mg/mg.c 
that prints the residuals along the way. 

Reply via email to