Re: [petsc-users] Poor weak scaling when solving successive linearsystems

Junchao Zhang Mon, 04 Jun 2018 12:57:02 -0700

Miachael,  I can compile and run you test.  I am now profiling it. Thanks.

--Junchao Zhang


On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <
michael.bec...@physik.uni-giessen.de> wrote:

> Hello again,
> this took me longer than I anticipated, but here we go.
> I did reruns of the cases where only half the processes per node were used
> (without -log_sync):
>
>                      125 procs,1st           125 procs,2nd          1000
> procs,1st          1000 procs,2nd
>                    Max        Ratio        Max        Ratio        Max
> Ratio        Max        Ratio
> KSPSolve           1.203E+02    1.0        1.210E+02    1.0
> 1.399E+02    1.1        1.365E+02    1.0
> VecTDot            6.376E+00    3.7        6.551E+00    4.0
> 7.885E+00    2.9        7.175E+00    3.4
> VecNorm            4.579E+00    7.1        5.803E+00   10.2
> 8.534E+00    6.9        6.026E+00    4.9
> VecScale           1.070E-01    2.1        1.129E-01    2.2
> 1.301E-01    2.5        1.270E-01    2.4
> VecCopy            1.123E-01    1.3        1.149E-01    1.3
> 1.301E-01    1.6        1.359E-01    1.6
> VecSet             7.063E-01    1.7        6.968E-01    1.7
> 7.432E-01    1.8        7.425E-01    1.8
> VecAXPY            1.166E+00    1.4        1.167E+00    1.4
> 1.221E+00    1.5        1.279E+00    1.6
> VecAYPX            1.317E+00    1.6        1.290E+00    1.6
> 1.536E+00    1.9        1.499E+00    2.0
> VecScatterBegin    6.142E+00    3.2        5.974E+00    2.8
> 6.448E+00    3.0        6.472E+00    2.9
> VecScatterEnd      3.606E+01    4.2        3.551E+01    4.0
> 5.244E+01    2.7        4.995E+01    2.7
> MatMult            3.561E+01    1.6        3.403E+01    1.5
> 3.435E+01    1.4        3.332E+01    1.4
> MatMultAdd         1.124E+01    2.0        1.130E+01    2.1
> 2.093E+01    2.9        1.995E+01    2.7
> MatMultTranspose   1.372E+01    2.5        1.388E+01    2.6
> 1.477E+01    2.2        1.381E+01    2.1
> MatSolve           1.949E-02    0.0        1.653E-02    0.0
> 4.789E-02    0.0        4.466E-02    0.0
> MatSOR             6.610E+01    1.3        6.673E+01    1.3
> 7.111E+01    1.3        7.105E+01    1.3
> MatResidual        2.647E+01    1.7        2.667E+01    1.7
> 2.446E+01    1.4        2.467E+01    1.5
> PCSetUpOnBlocks    5.266E-03    1.4        5.295E-03    1.4
> 5.427E-03    1.5        5.289E-03    1.4
> PCApply            1.031E+02    1.0        1.035E+02    1.0
> 1.180E+02    1.0        1.164E+02    1.0
>
> I also slimmed down my code and basically wrote a simple weak scaling test
> (source files attached) so you can profile it yourself. I appreciate the
> offer Junchao, thank you.
> You can adjust the system size per processor at runtime via
> "-nodes_per_proc 30" and the number of repeated calls to the function
> containing KSPsolve() via "-iterations 1000". The physical problem is
> simply calculating the electric potential from a homogeneous charge
> distribution, done multiple times to accumulate time in KSPsolve().
> A job would be started using something like
>
> mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4
> -iterations 1000 \\
>  -ksp_rtol 1E-6 \
>  -log_view -log_sync\
>  -pc_type gamg -pc_gamg_type classical\
>  -ksp_type cg \
>  -ksp_norm_type unpreconditioned \
>  -mg_levels_ksp_type richardson \
>  -mg_levels_ksp_norm_type none \
>  -mg_levels_pc_type sor \
>  -mg_levels_ksp_max_it 1 \
>  -mg_levels_pc_sor_its 1 \
>  -mg_levels_esteig_ksp_type cg \
>  -mg_levels_esteig_ksp_max_it 10 \
>  -gamg_est_ksp_type cg
>
> , ideally started on a cube number of processes for a cubical process grid.
> Using 125 processes and 10.000 iterations I get the output in
> "log_view_125_new.txt", which shows the same imbalance for me.
>
> Michael
>
>
> Am 02.06.2018 um 13:40 schrieb Mark Adams:
>
>
>
> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <jczh...@mcs.anl.gov>
> wrote:
>
>> Hi,Michael,
>>   You can add -log_sync besides -log_view, which adds barriers to certain
>> events but measures barrier time separately from the events. I find this
>> option makes it easier to interpret log_view output.
>>
>
> That is great (good to know).
>
> This should give us a better idea if your large VecScatter costs are from
> slow communication or if it catching some sort of load imbalance.
>
>
>>
>> --Junchao Zhang
>>
>> On Wed, May 30, 2018 at 3:27 AM, Michael Becker <
>> michael.bec...@physik.uni-giessen.de> wrote:
>>
>>> Barry: On its way. Could take a couple days again.
>>>
>>> Junchao: I unfortunately don't have access to a cluster with a faster
>>> network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking fat-tree
>>> network, which I realize causes parallel slowdown if the nodes are not
>>> connected to the same switch. Each node has 24 processors (2x12/socket) and
>>> four NUMA domains (two for each socket).
>>> The ranks are usually not distributed perfectly even, i.e. for 125
>>> processes, of the six required nodes, five would use 21 cores and one 20.
>>> Would using another CPU type make a difference communication-wise? I
>>> could switch to faster ones (on the same network), but I always assumed
>>> this would only improve performance of the stuff that is unrelated to
>>> communication.
>>>
>>> Michael
>>>
>>>
>>>
>>> The log files have something like "Average time for zero size
>>> MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very slow
>>> network. A typical machine should give less than 1/10 of the latency you
>>> have. An easy way to try is just running the code on a machine with a
>>> faster network and see what happens.
>>>
>>> Also, how many cores & numa domains does a compute node have? I could
>>> not figure out how you distributed the 125 MPI ranks evenly.
>>>
>>> --Junchao Zhang
>>>
>>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker <
>>> michael.bec...@physik.uni-giessen.de> wrote:
>>>
>>>> Hello again,
>>>>
>>>> here are the updated log_view files for 125 and 1000 processors. I ran
>>>> both problems twice, the first time with all processors per node allocated
>>>> ("-1.txt"), the second with only half on twice the number of nodes
>>>> ("-2.txt").
>>>>
>>>> On May 24, 2018, at 12:24 AM, Michael Becker 
>>>> <michael.bec...@physik.uni-giessen.de> 
>>>> <michael.bec...@physik.uni-giessen.de> wrote:
>>>>
>>>> I noticed that for every individual KSP iteration, six vector objects are 
>>>> created and destroyed (with CG, more with e.g. GMRES).
>>>>
>>>>    Hmm, it is certainly not intended at vectors be created and destroyed 
>>>> within each KSPSolve() could you please point us to the code that makes 
>>>> you think they are being created and destroyed?   We create all the work 
>>>> vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. 
>>>> Not that this would be a measurable distance.
>>>>
>>>>
>>>> I mean this, right in the log_view output:
>>>>
>>>> Memory usage is given in bytes:
>>>>
>>>> Object Type Creations Destructions Memory Descendants' Mem.
>>>> Reports information only for process 0.
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> ...
>>>>
>>>> --- Event Stage 1: First Solve
>>>>
>>>> ...
>>>>
>>>> --- Event Stage 2: Remaining Solves
>>>>
>>>> Vector 23904 23904 1295501184 0.
>>>>
>>>> I logged the exact number of KSP iterations over the 999 timesteps and
>>>> its exactly 23904/6 = 3984.
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.:
>>>>
>>>>   Please send the log file for 1000 with cg as the solver.
>>>>
>>>>    You should make a bar chart of each event for the two cases to see 
>>>> which ones are taking more time and which are taking less (we cannot tell 
>>>> with the two logs you sent us since they are for different solvers.)
>>>>
>>>>
>>>>
>>>>
>>>> On May 24, 2018, at 12:24 AM, Michael Becker 
>>>> <michael.bec...@physik.uni-giessen.de> 
>>>> <michael.bec...@physik.uni-giessen.de> wrote:
>>>>
>>>> I noticed that for every individual KSP iteration, six vector objects are 
>>>> created and destroyed (with CG, more with e.g. GMRES).
>>>>
>>>>    Hmm, it is certainly not intended at vectors be created and destroyed 
>>>> within each KSPSolve() could you please point us to the code that makes 
>>>> you think they are being created and destroyed?   We create all the work 
>>>> vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. 
>>>> Not that this would be a measurable distance.
>>>>
>>>>
>>>>
>>>>
>>>> This seems kind of wasteful, is this supposed to be like this? Is this 
>>>> even the reason for my problems? Apart from that, everything seems quite 
>>>> normal to me (but I'm not the expert here).
>>>>
>>>>
>>>> Thanks in advance.
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> <log_view_125procs.txt><log_view_1000procs.txt>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Re: [petsc-users] Poor weak scaling when solving successive linearsystems

Reply via email to