Re: [petsc-users] About parallel performance

Barry Smith Mon, 02 Jun 2014 09:07:47 -0700

On Jun 2, 2014, at 9:13 AM, Qin Lu <[email protected]> wrote:

> Will the speedup measured by the streams benchmark the upper limit of speedup 
> of a parallel program? I.e., suppose there is a program with ideal linear 
> speedup (=2 for np=2 if running in a perfect machine for parallelism), if it 
> runs in your laptop, the maximum speedup would be 1.44 with np=2?


   It depends on the computation being run. For PETSc solvers it is generally a 
pretty good measure of the upper bound. 
>  
> Thanks,
> Qin
> 
> From: Barry Smith <[email protected]>
> To: Qin Lu <[email protected]> 
> Cc: petsc-users <[email protected]> 
> Sent: Thursday, May 29, 2014 5:46 PM
> Subject: Re: [petsc-users] About parallel performance
> 
> 
>   For the parallel case a perfect machine would have twice the memory 
> bandwidth when using 2 cores as opposed to 1 core. For yours it is almost 
> exactly the same. The issue is not with the MPI or software. It depends on 
> how many memory sockets there are and how they are shared by the various 
> cores. As I said the initial memory bandwidth for one core 21,682. gigabytes 
> per second is good so it is a very good sequential machine. 
> 
>   Here are the results on my laptop 
> 
> Number of MPI processes 1
> Process 0 Barrys-MacBook-Pro.local
> Function      Rate (MB/s) 
> Copy:        7928.7346
> Scale:      8271.5103
> Add:        11017.0430
> Triad:      10843.9018
> Number of MPI processes 2
> Process 0 Barrys-MacBook-Pro.local
> Process 1 Barrys-MacBook-Pro.local
> Function      Rate (MB/s) 
> Copy:      13513.0365
> Scale:      13516.7086
> Add:        15455.3952
> Triad:      15562.0822
> ------------------------------------------------
> np  speedup
> 1 1.0
> 2 1.44
> 
> 
> Note that the memory bandwidth is much lower than your machine but there is 
> an increase in speedup from one to two cores because one core cannot utilize 
> all the memory bandwidth. But even with two cores my laptop will be slower on 
> PETSc then one core on your machine.
> 
> Here is the performance on a workstation we have that has multiple CPUs and 
> multiple memory sockets
> 
> Number of MPI processes 1
> Process 0 es
> Function      Rate (MB/s) 
> Copy:      13077.8260
> Scale:      12867.1966
> Add:        14637.6757
> Triad:      14414.4478
> Number of MPI processes 2
> Process 0 es
> Process 1 es
> Function      Rate (MB/s) 
> Copy:      22663.3116
> Scale:      22102.5495
> Add:        25768.1550
> Triad:      26076.0410
> Number of MPI processes 3
> Process 0 es
> Process 1 es
> Process 2 es
> Function      Rate (MB/s) 
> Copy:      27501.7610
> Scale:      26971.2183
> Add:        30433.3276
> Triad:      31302.9396
> Number of MPI processes 4
> Process 0 es
> Process 1 es
> Process 2 es
> Process 3 es
> Function      Rate (MB/s) 
> Copy:      29302.3183
> Scale:      30165.5295
> Add:        34577.3458
> Triad:      35195.8067
> ------------------------------------------------
> np  speedup
> 1 1.0
> 2 1.81
> 3 2.17
> 4 2.44
> 
> Note that one core has a lower memory bandwidth than your machine but as I 
> add more cores the memory bandwidth increases by a factor of 2.4
> 
> There is nothing wrong with your machine, it is just not suitable to run 
> sparse linear algebra on multiple cores for it.
> 
>   Barry
> 
> 
> 
> 
> On May 29, 2014, at 5:15 PM, Qin Lu <[email protected]> wrote:
> 
> > Barry,
> >  
> > How did you read the test results? For a machine good for parallism, should 
> > the data of np=2 be about half of the those of np=1?
> >  
> > The machine has very new Intel chips and is very for serial run. What may 
> > cause the bad parallism? - the configurations of the machine, or I am using 
> > a MPI lib (MPICH2) that was not built correctly?
> > Many thanks,
> > Qin
> >  
> > ----- Original Message -----
> > From: Barry Smith <[email protected]>
> > To: Qin Lu <[email protected]>; petsc-users <[email protected]>
> > Cc: 
> > Sent: Thursday, May 29, 2014 4:54 PM
> > Subject: Re: [petsc-users] About parallel performance
> > 
> > 
> >  In that PETSc version BasicVersion is actually the MPI streams benchmark 
> > so you ran the right thing. Your machine is totally worthless for sparse 
> > linear algebra parallelism. The entire memory bandwidth is used by the 
> > first core so adding the second core to the computation gives you no 
> > improvement at all in the streams benchmark. 
> > 
> >  But the single core memory bandwidth is pretty good so for problems that 
> > don’t need parallelism you should get good performance.
> > 
> >    Barry
> > 
> > 
> > 
> > 
> > On May 29, 2014, at 4:37 PM, Qin Lu <[email protected]> wrote:
> > 
> >> Barry,
> >> 
> >> I have PETSc-3.4.2 and I didn't see MPIVersion there; do you mean 
> >> BasicVersion? I built and ran it (if you did mean MPIVersion, I will get 
> >> PETSc-3.4 later):
> >> 
> >> =================
> >> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 1 ./BasicVersion
> >> Number of MPI processes 1
> >> Function      Rate (MB/s)
> >> Copy:      21682.9932
> >> Scale:      21637.5509
> >> Add:        21583.0395
> >> Triad:      21504.6563
> >> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 2 ./BasicVersion
> >> Number of MPI processes 2
> >> Function      Rate (MB/s)
> >> Copy:      21369.6976
> >> Scale:      21632.3203
> >> Add:        22203.7107
> >> Triad:      22305.1841
> >> =======================
> >> 
> >> Thanks a lot,
> >> Qin
> >> 
> >> From: Barry Smith <[email protected]>
> >> To: Qin Lu <[email protected]> 
> >> Cc: "[email protected]" <[email protected]> 
> >> Sent: Thursday, May 29, 2014 4:17 PM
> >> Subject: Re: [petsc-users] About parallel performance
> >> 
> >> 
> >> 
> >>    You need to run the streams benchmarks are one and two processes to see 
> >> how the memory bandwidth changes. If you are using petsc-3.4 you can 
> >> 
> >> cd  src/benchmarks/streams/ 
> >> 
> >> make MPIVersion
> >> 
> >> mpiexec -n 1 ./MPIVersion
> >> 
> >> mpiexec -n 2 ./MPIVersion 
> >> 
> >>    and send all the results
> >> 
> >>    Barry
> >> 
> >> 
> >> 
> >> On May 29, 2014, at 4:06 PM, Qin Lu <[email protected]> wrote:
> >> 
> >>> For now I only care about the CPU of PETSc subroutines. I tried to add 
> >>> PetscLogEventBegin/End and the results are consistent with the 
> >>> log_summary attached in my first email.
> >>>  
> >>> The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs 
> >>> are small (< 20 sec). The CPU of PCSetup/PCApply are about the same 
> >>> between p1 and p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a 
> >>> little faster than p1's (176 sec), but p2 spent more time in 
> >>> MatGetSubMatrice (43 sec). So the total CPU of PETSc subtroutines are 
> >>> about the same between p1 and p2 (502 sec vs. 488 sec).
> >>> 
> >>> It seems I need a more efficient parallel preconditioner. Do you have any 
> >>> suggestions for that?
> >>> 
> >>> Many thanks,
> >>> Qin
> >>> 
> >>> ----- Original Message -----
> >>> From: Barry Smith <[email protected]>
> >>> To: Qin Lu <[email protected]>
> >>> Cc: "[email protected]" <[email protected]>
> >>> Sent: Thursday, May 29, 2014 2:12 PM
> >>> Subject: Re: [petsc-users] About parallel performance
> >>> 
> >>> 
> >>>      You need to determine where the other 80% of the time is. My guess 
> >>> it is in setting the values into the matrix each time. Use 
> >>> PetscLogEventRegister() and put a PetscLogEventBegin/End() around the 
> >>> code that computes all the entries in the matrix and calls MatSetValues() 
> >>> and MatAssemblyBegin/End().
> >>> 
> >>>      Likely the reason the linear solver does not scale better is that 
> >>> you have a machine with multiple cores that share the same memory 
> >>> bandwidth and the first core is already using well over half the memory 
> >>> bandwidth so the second core cannot be fully utilized since both cores 
> >>> have to wait for data to arrive from memory.  If you are using the 
> >>> development version of PETSc you can run make streams NPMAX=2 from the 
> >>> PETSc root directory and send this to us to confirm this.
> >>> 
> >>>      Barry
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> On May 29, 2014, at 1:23 PM, Qin Lu <[email protected]> wrote:
> >>> 
> >>>> Hello,
> >>>> 
> >>>> I implemented PETSc parallel linear solver in a program, the 
> >>>> implementation is basically the same as 
> >>>> /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I preallocated the 
> >>>> MatMPIAIJ, and let PETSc partition the matrix through 
> >>>> MatGetOwnershipRange. However, a few tests shows the parallel solver is 
> >>>> always a little slower the serial solver (I have excluded the matrix 
> >>>> generation CPU).
> >>>> 
> >>>> For serial run I used PCILU as preconditioner; for parallel run, I used 
> >>>> ASM with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type 
> >>>> preonly -ksp_type bcgs -pc_type asm). The number of unknowns are around 
> >>>> 200,000.
> >>>>  
> >>>> I have used -log_summary to print out the performance summary as 
> >>>> attached (log_summary_p1 for serial run and log_summary_p2 for the run 
> >>>> with 2 processes). It seems the KSPSolve counts only for less than 20% 
> >>>> of Global %T. 
> >>>> My questions are:
> >>>>  
> >>>> 1. what is the bottle neck of the parallel run according to the summary?
> >>>> 2. Do you have any suggestions to improve the parallel performance?
> >>>>  
> >>>> Thanks a lot for your suggestions!
> >>>>  
> >>>> Regards,
> >>>> Qin    <log_summary_p1.txt><log_summary_p2.txt>      
> 
>

Re: [petsc-users] About parallel performance

Reply via email to