Miachael, I can compile and run you test. I am now profiling it. Thanks. --Junchao Zhang
On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker < michael.bec...@physik.uni-giessen.de> wrote: > Hello again, > this took me longer than I anticipated, but here we go. > I did reruns of the cases where only half the processes per node were used > (without -log_sync): > > 125 procs,1st 125 procs,2nd 1000 > procs,1st 1000 procs,2nd > Max Ratio Max Ratio Max > Ratio Max Ratio > KSPSolve 1.203E+02 1.0 1.210E+02 1.0 > 1.399E+02 1.1 1.365E+02 1.0 > VecTDot 6.376E+00 3.7 6.551E+00 4.0 > 7.885E+00 2.9 7.175E+00 3.4 > VecNorm 4.579E+00 7.1 5.803E+00 10.2 > 8.534E+00 6.9 6.026E+00 4.9 > VecScale 1.070E-01 2.1 1.129E-01 2.2 > 1.301E-01 2.5 1.270E-01 2.4 > VecCopy 1.123E-01 1.3 1.149E-01 1.3 > 1.301E-01 1.6 1.359E-01 1.6 > VecSet 7.063E-01 1.7 6.968E-01 1.7 > 7.432E-01 1.8 7.425E-01 1.8 > VecAXPY 1.166E+00 1.4 1.167E+00 1.4 > 1.221E+00 1.5 1.279E+00 1.6 > VecAYPX 1.317E+00 1.6 1.290E+00 1.6 > 1.536E+00 1.9 1.499E+00 2.0 > VecScatterBegin 6.142E+00 3.2 5.974E+00 2.8 > 6.448E+00 3.0 6.472E+00 2.9 > VecScatterEnd 3.606E+01 4.2 3.551E+01 4.0 > 5.244E+01 2.7 4.995E+01 2.7 > MatMult 3.561E+01 1.6 3.403E+01 1.5 > 3.435E+01 1.4 3.332E+01 1.4 > MatMultAdd 1.124E+01 2.0 1.130E+01 2.1 > 2.093E+01 2.9 1.995E+01 2.7 > MatMultTranspose 1.372E+01 2.5 1.388E+01 2.6 > 1.477E+01 2.2 1.381E+01 2.1 > MatSolve 1.949E-02 0.0 1.653E-02 0.0 > 4.789E-02 0.0 4.466E-02 0.0 > MatSOR 6.610E+01 1.3 6.673E+01 1.3 > 7.111E+01 1.3 7.105E+01 1.3 > MatResidual 2.647E+01 1.7 2.667E+01 1.7 > 2.446E+01 1.4 2.467E+01 1.5 > PCSetUpOnBlocks 5.266E-03 1.4 5.295E-03 1.4 > 5.427E-03 1.5 5.289E-03 1.4 > PCApply 1.031E+02 1.0 1.035E+02 1.0 > 1.180E+02 1.0 1.164E+02 1.0 > > I also slimmed down my code and basically wrote a simple weak scaling test > (source files attached) so you can profile it yourself. I appreciate the > offer Junchao, thank you. > You can adjust the system size per processor at runtime via > "-nodes_per_proc 30" and the number of repeated calls to the function > containing KSPsolve() via "-iterations 1000". The physical problem is > simply calculating the electric potential from a homogeneous charge > distribution, done multiple times to accumulate time in KSPsolve(). > A job would be started using something like > > mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4 > -iterations 1000 \\ > -ksp_rtol 1E-6 \ > -log_view -log_sync\ > -pc_type gamg -pc_gamg_type classical\ > -ksp_type cg \ > -ksp_norm_type unpreconditioned \ > -mg_levels_ksp_type richardson \ > -mg_levels_ksp_norm_type none \ > -mg_levels_pc_type sor \ > -mg_levels_ksp_max_it 1 \ > -mg_levels_pc_sor_its 1 \ > -mg_levels_esteig_ksp_type cg \ > -mg_levels_esteig_ksp_max_it 10 \ > -gamg_est_ksp_type cg > > , ideally started on a cube number of processes for a cubical process grid. > Using 125 processes and 10.000 iterations I get the output in > "log_view_125_new.txt", which shows the same imbalance for me. > > Michael > > > Am 02.06.2018 um 13:40 schrieb Mark Adams: > > > > On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <jczh...@mcs.anl.gov> > wrote: > >> Hi,Michael, >> You can add -log_sync besides -log_view, which adds barriers to certain >> events but measures barrier time separately from the events. I find this >> option makes it easier to interpret log_view output. >> > > That is great (good to know). > > This should give us a better idea if your large VecScatter costs are from > slow communication or if it catching some sort of load imbalance. > > >> >> --Junchao Zhang >> >> On Wed, May 30, 2018 at 3:27 AM, Michael Becker < >> michael.bec...@physik.uni-giessen.de> wrote: >> >>> Barry: On its way. Could take a couple days again. >>> >>> Junchao: I unfortunately don't have access to a cluster with a faster >>> network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking fat-tree >>> network, which I realize causes parallel slowdown if the nodes are not >>> connected to the same switch. Each node has 24 processors (2x12/socket) and >>> four NUMA domains (two for each socket). >>> The ranks are usually not distributed perfectly even, i.e. for 125 >>> processes, of the six required nodes, five would use 21 cores and one 20. >>> Would using another CPU type make a difference communication-wise? I >>> could switch to faster ones (on the same network), but I always assumed >>> this would only improve performance of the stuff that is unrelated to >>> communication. >>> >>> Michael >>> >>> >>> >>> The log files have something like "Average time for zero size >>> MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very slow >>> network. A typical machine should give less than 1/10 of the latency you >>> have. An easy way to try is just running the code on a machine with a >>> faster network and see what happens. >>> >>> Also, how many cores & numa domains does a compute node have? I could >>> not figure out how you distributed the 125 MPI ranks evenly. >>> >>> --Junchao Zhang >>> >>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker < >>> michael.bec...@physik.uni-giessen.de> wrote: >>> >>>> Hello again, >>>> >>>> here are the updated log_view files for 125 and 1000 processors. I ran >>>> both problems twice, the first time with all processors per node allocated >>>> ("-1.txt"), the second with only half on twice the number of nodes >>>> ("-2.txt"). >>>> >>>> On May 24, 2018, at 12:24 AM, Michael Becker >>>> <michael.bec...@physik.uni-giessen.de> >>>> <michael.bec...@physik.uni-giessen.de> wrote: >>>> >>>> I noticed that for every individual KSP iteration, six vector objects are >>>> created and destroyed (with CG, more with e.g. GMRES). >>>> >>>> Hmm, it is certainly not intended at vectors be created and destroyed >>>> within each KSPSolve() could you please point us to the code that makes >>>> you think they are being created and destroyed? We create all the work >>>> vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. >>>> Not that this would be a measurable distance. >>>> >>>> >>>> I mean this, right in the log_view output: >>>> >>>> Memory usage is given in bytes: >>>> >>>> Object Type Creations Destructions Memory Descendants' Mem. >>>> Reports information only for process 0. >>>> >>>> --- Event Stage 0: Main Stage >>>> >>>> ... >>>> >>>> --- Event Stage 1: First Solve >>>> >>>> ... >>>> >>>> --- Event Stage 2: Remaining Solves >>>> >>>> Vector 23904 23904 1295501184 0. >>>> >>>> I logged the exact number of KSP iterations over the 999 timesteps and >>>> its exactly 23904/6 = 3984. >>>> >>>> Michael >>>> >>>> >>>> >>>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.: >>>> >>>> Please send the log file for 1000 with cg as the solver. >>>> >>>> You should make a bar chart of each event for the two cases to see >>>> which ones are taking more time and which are taking less (we cannot tell >>>> with the two logs you sent us since they are for different solvers.) >>>> >>>> >>>> >>>> >>>> On May 24, 2018, at 12:24 AM, Michael Becker >>>> <michael.bec...@physik.uni-giessen.de> >>>> <michael.bec...@physik.uni-giessen.de> wrote: >>>> >>>> I noticed that for every individual KSP iteration, six vector objects are >>>> created and destroyed (with CG, more with e.g. GMRES). >>>> >>>> Hmm, it is certainly not intended at vectors be created and destroyed >>>> within each KSPSolve() could you please point us to the code that makes >>>> you think they are being created and destroyed? We create all the work >>>> vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. >>>> Not that this would be a measurable distance. >>>> >>>> >>>> >>>> >>>> This seems kind of wasteful, is this supposed to be like this? Is this >>>> even the reason for my problems? Apart from that, everything seems quite >>>> normal to me (but I'm not the expert here). >>>> >>>> >>>> Thanks in advance. >>>> >>>> Michael >>>> >>>> >>>> >>>> <log_view_125procs.txt><log_view_1000procs.txt> >>>> >>>> >>>> >>> >>> >> > >