Hi Pierre,

I’m using GAMG on a shifted Laplacian with these options:
-st_fieldsplit_pressure_ksp_type preonly
-st_fieldsplit_pressure_pc_composite_type additive
-st_fieldsplit_pressure_pc_type composite
-st_fieldsplit_pressure_sub_0_ksp_pc_type jacobi
-st_fieldsplit_pressure_sub_0_pc_type ksp
-st_fieldsplit_pressure_sub_1_ksp_pc_gamg_square_graph 10
-st_fieldsplit_pressure_sub_1_ksp_pc_type gamg
-st_fieldsplit_pressure_sub_1_pc_type ksp

and I end up with the following logs on 512 (top) and 2048 (bottom) processes:
MatMult          1577790 1.0 3.1967e+03 1.2 4.48e+12 1.6 7.6e+09 5.6e+03 
0.0e+00  7 71 75 63  0   7 71 75 63  0 650501
MatMultAdd        204786 1.0 1.3412e+02 5.5 1.50e+10 1.7 5.5e+08 2.7e+02 
0.0e+00  0  0  5  0  0   0  0  5  0  0 50762
MatMultTranspose  204786 1.0 4.6790e+01 4.3 1.50e+10 1.7 5.5e+08 2.7e+02 
0.0e+00  0  0  5  0  0   0  0  5  0  0 145505
[..]
KSPSolve_FS_3       7286 1.0 7.5506e+02 1.0 9.14e+11 1.8 7.3e+09 1.5e+03 
2.6e+05  2 14 71 16 34   2 14 71 16 34 539009

MatMult          1778795 1.0 3.5511e+03 4.1 1.46e+12 1.9 4.0e+10 2.4e+03 
0.0e+00  7 66 75 61  0   7 66 75 61  0 728371
MatMultAdd        222360 1.0 2.5904e+0348.0 4.31e+09 1.9 2.4e+09 1.3e+02 
0.0e+00 14  0  4  0  0  14  0  4  0  0  2872
MatMultTranspose  222360 1.0 1.8736e+03421.8 4.31e+09 1.9 2.4e+09 1.3e+02 
0.0e+00  0  0  4  0  0   0  0  4  0  0  3970
[..]
KSPSolve_FS_3       7412 1.0 2.8939e+03 1.0 2.66e+11 2.1 3.5e+10 6.1e+02 
2.7e+05 17 11 67 14 28  17 11 67 14 28 148175

MatMultAdd and MatMultTranspose (performed by GAMG) somehow ruin the scalability of the overall solver. The pressure space “only” has 3M unknowns so I’m guessing that’s why GAMG is having a hard time strong scaling.

3M unknowns divided by 512 processes implies less than 10k unknowns per process. It is not unusual to see strong scaling roll off at this size. Also note that the time per call(!) for "MatMult" is the same for both cases, indicating that your run into a latency-limited regime.

Also, have a look at the time ratios: With 2048 processes, MatMultAdd and MatMultTranspose show a time ratio of 48 and 421, respectively. Maybe one of your MPI ranks is getting a huge workload?


For the other fields, the matrix is somehow distributed nicely, i.e., I don’t 
want to change the overall distribution of the matrix.
Do you have any suggestion to improve the performance of GAMG in that scenario? 
I had two ideas in mind but please correct me if I’m wrong or if this is not 
doable:
1) before setting up GAMG, first use a PCTELESCOPE to avoid having too many 
processes work on this small problem
2) have the sub_0_ and the sub_1_ work on two different nonoverlapping 
communicators of size PETSC_COMM_WORLD/2, do the solve concurrently, and then 
sum the solutions (only worth doing because of -pc_composite_type additive). I 
have no idea if this easily doable with PETSc command line arguments

1) is the more flexible approach, as you have better control over the system sizes after 'telescoping'.

Best regards,
Karli

Reply via email to