Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

Smith, Barry F. Thu, 07 Jun 2018 13:12:01 -0700

> On Jun 7, 2018, at 12:27 PM, Zhang, Junchao <[email protected]> wrote:
> 
> Searched but could not find this option, -mat_view::load_balance

   There is a space between the view and the :   load_balance is a particular 
viewer format that causes the printing of load balance information about number 
of nonzeros in the matrix.

   Barry

> 
> --Junchao Zhang
> 
> On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. <[email protected]> wrote:
>  So the only surprise in the results is the SOR. It is embarrassingly 
> parallel and normally one would not see a jump.
> 
>  The load balance for SOR time 1.5 is better at 1000 processes than for 125 
> processes of 2.1  not worse so this number doesn't easily explain it.
> 
>  Could you run the 125 and 1000 with -mat_view ::load_balance and see what 
> you get out?
> 
>    Thanks
> 
>      Barry
> 
>  Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync is 
> on. My only guess is that the MatSOR is sharing memory bandwidth (or some 
> other resource? cores?) with the VecScatter and for some reason this is worse 
> for 1000 cores but I don't know why.
> 
> > On Jun 6, 2018, at 9:13 PM, Junchao Zhang <[email protected]> wrote:
> > 
> > Hi, PETSc developers,
> >  I tested Michael Becker's code. The code calls the same KSPSolve 1000 
> > times in the second stage and needs cubic number of processors to run. I 
> > ran with 125 ranks and 1000 ranks, with or without -log_sync option. I 
> > attach the log view output files and a scaling loss excel file.
> >  I profiled the code with 125 processors. It looks {MatSOR, MatMult, 
> > MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took 
> > ~50% of the time,  The other half time was spent on waiting in MPI.  
> > MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
> >  I tested it on a 36 cores/node machine. I found 32 ranks/node gave better 
> > performance (about 10%) than 36 ranks/node in the 125 ranks testing.  I 
> > guess it is because processors in the former had more balanced memory 
> > bandwidth. I collected PAPI_DP_OPS (double precision operations) and 
> > PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files). 
> > It looks ranks at the two ends have less DP_OPS and TOT_CYC. 
> >  Does anyone familiar with the algorithm have quick explanations?
> > 
> > --Junchao Zhang
> > 
> > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker 
> > <[email protected]> wrote:
> > Hello again,
> > 
> > this took me longer than I anticipated, but here we go.
> > I did reruns of the cases where only half the processes per node were used 
> > (without -log_sync):
> > 
> >                     125 procs,1st           125 procs,2nd          1000 
> > procs,1st          1000 procs,2nd
> >                   Max        Ratio        Max        Ratio        Max       
> >  Ratio        Max        Ratio
> > KSPSolve           1.203E+02    1.0        1.210E+02    1.0        
> > 1.399E+02    1.1        1.365E+02    1.0
> > VecTDot            6.376E+00    3.7        6.551E+00    4.0        
> > 7.885E+00    2.9        7.175E+00    3.4
> > VecNorm            4.579E+00    7.1        5.803E+00   10.2        
> > 8.534E+00    6.9        6.026E+00    4.9
> > VecScale           1.070E-01    2.1        1.129E-01    2.2        
> > 1.301E-01    2.5        1.270E-01    2.4
> > VecCopy            1.123E-01    1.3        1.149E-01    1.3        
> > 1.301E-01    1.6        1.359E-01    1.6
> > VecSet             7.063E-01    1.7        6.968E-01    1.7        
> > 7.432E-01    1.8        7.425E-01    1.8
> > VecAXPY            1.166E+00    1.4        1.167E+00    1.4        
> > 1.221E+00    1.5        1.279E+00    1.6
> > VecAYPX            1.317E+00    1.6        1.290E+00    1.6        
> > 1.536E+00    1.9        1.499E+00    2.0
> > VecScatterBegin    6.142E+00    3.2        5.974E+00    2.8        
> > 6.448E+00    3.0        6.472E+00    2.9
> > VecScatterEnd      3.606E+01    4.2        3.551E+01    4.0        
> > 5.244E+01    2.7        4.995E+01    2.7
> > MatMult            3.561E+01    1.6        3.403E+01    1.5        
> > 3.435E+01    1.4        3.332E+01    1.4
> > MatMultAdd         1.124E+01    2.0        1.130E+01    2.1        
> > 2.093E+01    2.9        1.995E+01    2.7
> > MatMultTranspose   1.372E+01    2.5        1.388E+01    2.6        
> > 1.477E+01    2.2        1.381E+01    2.1
> > MatSolve           1.949E-02    0.0        1.653E-02    0.0        
> > 4.789E-02    0.0        4.466E-02    0.0
> > MatSOR             6.610E+01    1.3        6.673E+01    1.3        
> > 7.111E+01    1.3        7.105E+01    1.3
> > MatResidual        2.647E+01    1.7        2.667E+01    1.7        
> > 2.446E+01    1.4        2.467E+01    1.5
> > PCSetUpOnBlocks    5.266E-03    1.4        5.295E-03    1.4        
> > 5.427E-03    1.5        5.289E-03    1.4
> > PCApply            1.031E+02    1.0        1.035E+02    1.0        
> > 1.180E+02    1.0        1.164E+02    1.0
> > 
> > I also slimmed down my code and basically wrote a simple weak scaling test 
> > (source files attached) so you can profile it yourself. I appreciate the 
> > offer Junchao, thank you.
> > You can adjust the system size per processor at runtime via 
> > "-nodes_per_proc 30" and the number of repeated calls to the function 
> > containing KSPsolve() via "-iterations 1000". The physical problem is 
> > simply calculating the electric potential from a homogeneous charge 
> > distribution, done multiple times to accumulate time in KSPsolve().
> > A job would be started using something like
> > mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4 
> > -iterations 1000 \\
> > -ksp_rtol 1E-6 \
> > -log_view -log_sync\
> > -pc_type gamg -pc_gamg_type classical\
> > -ksp_type cg \
> > -ksp_norm_type unpreconditioned \
> > -mg_levels_ksp_type richardson \
> > -mg_levels_ksp_norm_type none \
> > -mg_levels_pc_type sor \
> > -mg_levels_ksp_max_it 1 \
> > -mg_levels_pc_sor_its 1 \
> > -mg_levels_esteig_ksp_type cg \
> > -mg_levels_esteig_ksp_max_it 10 \
> > -gamg_est_ksp_type cg
> > , ideally started on a cube number of processes for a cubical process grid.
> > Using 125 processes and 10.000 iterations I get the output in 
> > "log_view_125_new.txt", which shows the same imbalance for me.
> > Michael
> > 
> > 
> > Am 02.06.2018 um 13:40 schrieb Mark Adams:
> >> 
> >> 
> >> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <[email protected]> wrote:
> >> Hi,Michael,
> >>  You can add -log_sync besides -log_view, which adds barriers to certain 
> >> events but measures barrier time separately from the events. I find this 
> >> option makes it easier to interpret log_view output.
> >> 
> >> That is great (good to know).
> >> 
> >> This should give us a better idea if your large VecScatter costs are from 
> >> slow communication or if it catching some sort of load imbalance.
> >> 
> >> 
> >> --Junchao Zhang
> >> 
> >> On Wed, May 30, 2018 at 3:27 AM, Michael Becker 
> >> <[email protected]> wrote:
> >> Barry: On its way. Could take a couple days again.
> >> 
> >> Junchao: I unfortunately don't have access to a cluster with a faster 
> >> network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking fat-tree 
> >> network, which I realize causes parallel slowdown if the nodes are not 
> >> connected to the same switch. Each node has 24 processors (2x12/socket) 
> >> and four NUMA domains (two for each socket).
> >> The ranks are usually not distributed perfectly even, i.e. for 125 
> >> processes, of the six required nodes, five would use 21 cores and one 20.
> >> Would using another CPU type make a difference communication-wise? I could 
> >> switch to faster ones (on the same network), but I always assumed this 
> >> would only improve performance of the stuff that is unrelated to 
> >> communication.
> >> 
> >> Michael
> >> 
> >> 
> >> 
> >>> The log files have something like "Average time for zero size MPI_Send(): 
> >>> 1.84231e-05". It looks you ran on a cluster with a very slow network. A 
> >>> typical machine should give less than 1/10 of the latency you have. An 
> >>> easy way to try is just running the code on a machine with a faster 
> >>> network and see what happens.
> >>> 
> >>> Also, how many cores & numa domains does a compute node have? I could not 
> >>> figure out how you distributed the 125 MPI ranks evenly.
> >>> 
> >>> --Junchao Zhang
> >>> 
> >>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker 
> >>> <[email protected]> wrote:
> >>> Hello again,
> >>> 
> >>> here are the updated log_view files for 125 and 1000 processors. I ran 
> >>> both problems twice, the first time with all processors per node 
> >>> allocated ("-1.txt"), the second with only half on twice the number of 
> >>> nodes ("-2.txt").
> >>> 
> >>>>> On May 24, 2018, at 12:24 AM, Michael Becker 
> >>>>> <[email protected]>
> >>>>> wrote:
> >>>>> 
> >>>>> I noticed that for every individual KSP iteration, six vector objects 
> >>>>> are created and destroyed (with CG, more with e.g. GMRES). 
> >>>>> 
> >>>>   Hmm, it is certainly not intended at vectors be created and destroyed 
> >>>> within each KSPSolve() could you please point us to the code that makes 
> >>>> you think they are being created and destroyed?   We create all the work 
> >>>> vectors at KSPSetUp() and destroy them in KSPReset() not during the 
> >>>> solve. Not that this would be a measurable distance.
> >>>> 
> >>> 
> >>> I mean this, right in the log_view output:
> >>> 
> >>>> Memory usage is given in bytes:
> >>>> 
> >>>> Object Type Creations Destructions Memory Descendants' Mem.
> >>>> Reports information only for process 0.
> >>>> 
> >>>> --- Event Stage 0: Main Stage
> >>>> 
> >>>> ...
> >>>> 
> >>>> --- Event Stage 1: First Solve
> >>>> 
> >>>> ...
> >>>> 
> >>>> --- Event Stage 2: Remaining Solves
> >>>> 
> >>>> Vector 23904 23904 1295501184 0.
> >>> I logged the exact number of KSP iterations over the 999 timesteps and 
> >>> its exactly 23904/6 = 3984.
> >>> Michael
> >>> 
> >>> 
> >>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.:
> >>>> 
> >>>>  Please send the log file for 1000 with cg as the solver.
> >>>> 
> >>>>   You should make a bar chart of each event for the two cases to see 
> >>>> which ones are taking more time and which are taking less (we cannot 
> >>>> tell with the two logs you sent us since they are for different solvers.)
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>>> On May 24, 2018, at 12:24 AM, Michael Becker 
> >>>>> <[email protected]>
> >>>>> wrote:
> >>>>> 
> >>>>> I noticed that for every individual KSP iteration, six vector objects 
> >>>>> are created and destroyed (with CG, more with e.g. GMRES). 
> >>>>> 
> >>>>   Hmm, it is certainly not intended at vectors be created and destroyed 
> >>>> within each KSPSolve() could you please point us to the code that makes 
> >>>> you think they are being created and destroyed?   We create all the work 
> >>>> vectors at KSPSetUp() and destroy them in KSPReset() not during the 
> >>>> solve. Not that this would be a measurable distance.
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>>> This seems kind of wasteful, is this supposed to be like this? Is this 
> >>>>> even the reason for my problems? Apart from that, everything seems 
> >>>>> quite normal to me (but I'm not the expert here).
> >>>>> 
> >>>>> 
> >>>>> Thanks in advance.
> >>>>> 
> >>>>> Michael
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> <log_view_125procs.txt><log_vi
> >>>>> ew_1000procs.txt>
> >>>>> 
> >>> 
> >>> 
> >> 
> >> 
> >> 
> > 
> > 
> > <o-wstest-125.txt><Scaling-loss.png><o-wstest-1000.txt><o-wstest-sync-125.txt><o-wstest-sync-1000.txt><MatSOR_SeqAIJ.png><PAPI_TOT_CYC.png><PAPI_DP_OPS.png>
> 
>
Re: [petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

Reply via email to