On Thu, Jul 26, 2018 at 11:15 AM, Fande Kong <fdkong...@gmail.com> wrote:
> > > On Thu, Jul 26, 2018 at 9:51 AM, Junchao Zhang <jczh...@mcs.anl.gov> > wrote: > >> Hi, Pierre, >> From your log_view files, I see you did strong scaling. You used 4X >> more cores, but the execution time only dropped from 3.9143e+04 >> to 1.6910e+04. >> From my previous analysis of a GAMG weak scaling test, it looks >> communication is one of the reasons that caused poor scaling. In your >> case, VecScatterEnd time was doubled from 1.5575e+03 to 3.2413e+03. Its >> time percent jumped from 1% to 17%. This time can contribute to the big >> time ratio in MatMultAdd ant MatMultTranspose, misleading you guys thinking >> there was load-imbalance computation-wise. >> The reason is that I found in the interpolation and restriction phases >> of gamg, the communication pattern is very bad. Few processes communicate >> with hundreds of neighbors with message sizes of a few bytes. >> > > We may need to truncate interpolation/restriction operators. Also do some > aggressive coarsening. Unfortunately, GAMG currently does not support. > Are these gamg options the truncation you thought? -pc_gamg_threshold[] <thresh,default=0> - Before aggregating the graph GAMG will remove small values from the graph on each level -pc_gamg_threshold_scale <scale,default=1> - Scaling of threshold on each coarser grid if not specified > Fande, > > >> If we can avoid this pattern algorithmically (which I don't know), or >> find ways with faster communication (which I am working), then we can get >> better scalability. >> >> --Junchao Zhang >> >> On Thu, Jul 26, 2018 at 10:02 AM, Pierre Jolivet < >> pierre.joli...@enseeiht.fr> wrote: >> >>> >>> >>> > On 26 Jul 2018, at 4:24 PM, Karl Rupp <r...@iue.tuwien.ac.at> wrote: >>> > >>> > Hi Pierre, >>> > >>> >> I’m using GAMG on a shifted Laplacian with these options: >>> >> -st_fieldsplit_pressure_ksp_type preonly >>> >> -st_fieldsplit_pressure_pc_composite_type additive >>> >> -st_fieldsplit_pressure_pc_type composite >>> >> -st_fieldsplit_pressure_sub_0_ksp_pc_type jacobi >>> >> -st_fieldsplit_pressure_sub_0_pc_type ksp >>> >> -st_fieldsplit_pressure_sub_1_ksp_pc_gamg_square_graph 10 >>> >> -st_fieldsplit_pressure_sub_1_ksp_pc_type gamg >>> >> -st_fieldsplit_pressure_sub_1_pc_type ksp >>> >> and I end up with the following logs on 512 (top) and 2048 (bottom) >>> processes: >>> >> MatMult 1577790 1.0 3.1967e+03 1.2 4.48e+12 1.6 7.6e+09 >>> 5.6e+03 0.0e+00 7 71 75 63 0 7 71 75 63 0 650501 >>> >> MatMultAdd 204786 1.0 1.3412e+02 5.5 1.50e+10 1.7 5.5e+08 >>> 2.7e+02 0.0e+00 0 0 5 0 0 0 0 5 0 0 50762 >>> >> MatMultTranspose 204786 1.0 4.6790e+01 4.3 1.50e+10 1.7 5.5e+08 >>> 2.7e+02 0.0e+00 0 0 5 0 0 0 0 5 0 0 145505 >>> >> [..] >>> >> KSPSolve_FS_3 7286 1.0 7.5506e+02 1.0 9.14e+11 1.8 7.3e+09 >>> 1.5e+03 2.6e+05 2 14 71 16 34 2 14 71 16 34 539009 >>> >> MatMult 1778795 1.0 3.5511e+03 4.1 1.46e+12 1.9 4.0e+10 >>> 2.4e+03 0.0e+00 7 66 75 61 0 7 66 75 61 0 728371 >>> >> MatMultAdd 222360 1.0 2.5904e+0348.0 4.31e+09 1.9 2.4e+09 >>> 1.3e+02 0.0e+00 14 0 4 0 0 14 0 4 0 0 2872 >>> >> MatMultTranspose 222360 1.0 1.8736e+03421.8 4.31e+09 1.9 2.4e+09 >>> 1.3e+02 0.0e+00 0 0 4 0 0 0 0 4 0 0 3970 >>> >> [..] >>> >> KSPSolve_FS_3 7412 1.0 2.8939e+03 1.0 2.66e+11 2.1 3.5e+10 >>> 6.1e+02 2.7e+05 17 11 67 14 28 17 11 67 14 28 148175 >>> >> MatMultAdd and MatMultTranspose (performed by GAMG) somehow ruin the >>> scalability of the overall solver. The pressure space “only” has 3M >>> unknowns so I’m guessing that’s why GAMG is having a hard time strong >>> scaling. >>> > >>> > 3M unknowns divided by 512 processes implies less than 10k unknowns >>> per process. It is not unusual to see strong scaling roll off at this size. >>> Also note that the time per call(!) for "MatMult" is the same for both >>> cases, indicating that your run into a latency-limited regime. >>> > >>> > Also, have a look at the time ratios: With 2048 processes, MatMultAdd >>> and MatMultTranspose show a time ratio of 48 and 421, respectively. Maybe >>> one of your MPI ranks is getting a huge workload? >>> >>> Maybe inside GAMG itself (how could I check this?), but since the timing >>> and ratio of the MatMult look OK and the distribution of the pressure space >>> is the same as the other three fields, I’m guessing this does not come from >>> my global Mat, but I may be wrong. >>> >>> >> For the other fields, the matrix is somehow distributed nicely, i.e., >>> I don’t want to change the overall distribution of the matrix. >>> >> Do you have any suggestion to improve the performance of GAMG in that >>> scenario? I had two ideas in mind but please correct me if I’m wrong or if >>> this is not doable: >>> >> 1) before setting up GAMG, first use a PCTELESCOPE to avoid having >>> too many processes work on this small problem >>> >> 2) have the sub_0_ and the sub_1_ work on two different >>> nonoverlapping communicators of size PETSC_COMM_WORLD/2, do the solve >>> concurrently, and then sum the solutions (only worth doing because of >>> -pc_composite_type additive). I have no idea if this easily doable with >>> PETSc command line arguments >>> > >>> > 1) is the more flexible approach, as you have better control over the >>> system sizes after 'telescoping’. >>> >>> Right, but the advantage of 2) is that I wouldn't have one half or more >>> of processes idling and I could overlap the solves of both subpc in the >>> PCCOMPOSITE. >>> >>> I’m attaching the -log_view for both runs (I trimmed some options). >>> >>> Thanks for your help, >>> Pierre >>> >>> >>> >>> > Best regards, >>> > Karli >>> >>> >>> >> >