On Jun 2, 2014, at 9:13 AM, Qin Lu <[email protected]> wrote: > Will the speedup measured by the streams benchmark the upper limit of speedup > of a parallel program? I.e., suppose there is a program with ideal linear > speedup (=2 for np=2 if running in a perfect machine for parallelism), if it > runs in your laptop, the maximum speedup would be 1.44 with np=2?
It depends on the computation being run. For PETSc solvers it is generally a pretty good measure of the upper bound. > > Thanks, > Qin > > From: Barry Smith <[email protected]> > To: Qin Lu <[email protected]> > Cc: petsc-users <[email protected]> > Sent: Thursday, May 29, 2014 5:46 PM > Subject: Re: [petsc-users] About parallel performance > > > For the parallel case a perfect machine would have twice the memory > bandwidth when using 2 cores as opposed to 1 core. For yours it is almost > exactly the same. The issue is not with the MPI or software. It depends on > how many memory sockets there are and how they are shared by the various > cores. As I said the initial memory bandwidth for one core 21,682. gigabytes > per second is good so it is a very good sequential machine. > > Here are the results on my laptop > > Number of MPI processes 1 > Process 0 Barrys-MacBook-Pro.local > Function Rate (MB/s) > Copy: 7928.7346 > Scale: 8271.5103 > Add: 11017.0430 > Triad: 10843.9018 > Number of MPI processes 2 > Process 0 Barrys-MacBook-Pro.local > Process 1 Barrys-MacBook-Pro.local > Function Rate (MB/s) > Copy: 13513.0365 > Scale: 13516.7086 > Add: 15455.3952 > Triad: 15562.0822 > ------------------------------------------------ > np speedup > 1 1.0 > 2 1.44 > > > Note that the memory bandwidth is much lower than your machine but there is > an increase in speedup from one to two cores because one core cannot utilize > all the memory bandwidth. But even with two cores my laptop will be slower on > PETSc then one core on your machine. > > Here is the performance on a workstation we have that has multiple CPUs and > multiple memory sockets > > Number of MPI processes 1 > Process 0 es > Function Rate (MB/s) > Copy: 13077.8260 > Scale: 12867.1966 > Add: 14637.6757 > Triad: 14414.4478 > Number of MPI processes 2 > Process 0 es > Process 1 es > Function Rate (MB/s) > Copy: 22663.3116 > Scale: 22102.5495 > Add: 25768.1550 > Triad: 26076.0410 > Number of MPI processes 3 > Process 0 es > Process 1 es > Process 2 es > Function Rate (MB/s) > Copy: 27501.7610 > Scale: 26971.2183 > Add: 30433.3276 > Triad: 31302.9396 > Number of MPI processes 4 > Process 0 es > Process 1 es > Process 2 es > Process 3 es > Function Rate (MB/s) > Copy: 29302.3183 > Scale: 30165.5295 > Add: 34577.3458 > Triad: 35195.8067 > ------------------------------------------------ > np speedup > 1 1.0 > 2 1.81 > 3 2.17 > 4 2.44 > > Note that one core has a lower memory bandwidth than your machine but as I > add more cores the memory bandwidth increases by a factor of 2.4 > > There is nothing wrong with your machine, it is just not suitable to run > sparse linear algebra on multiple cores for it. > > Barry > > > > > On May 29, 2014, at 5:15 PM, Qin Lu <[email protected]> wrote: > > > Barry, > > > > How did you read the test results? For a machine good for parallism, should > > the data of np=2 be about half of the those of np=1? > > > > The machine has very new Intel chips and is very for serial run. What may > > cause the bad parallism? - the configurations of the machine, or I am using > > a MPI lib (MPICH2) that was not built correctly? > > Many thanks, > > Qin > > > > ----- Original Message ----- > > From: Barry Smith <[email protected]> > > To: Qin Lu <[email protected]>; petsc-users <[email protected]> > > Cc: > > Sent: Thursday, May 29, 2014 4:54 PM > > Subject: Re: [petsc-users] About parallel performance > > > > > > In that PETSc version BasicVersion is actually the MPI streams benchmark > > so you ran the right thing. Your machine is totally worthless for sparse > > linear algebra parallelism. The entire memory bandwidth is used by the > > first core so adding the second core to the computation gives you no > > improvement at all in the streams benchmark. > > > > But the single core memory bandwidth is pretty good so for problems that > > don’t need parallelism you should get good performance. > > > > Barry > > > > > > > > > > On May 29, 2014, at 4:37 PM, Qin Lu <[email protected]> wrote: > > > >> Barry, > >> > >> I have PETSc-3.4.2 and I didn't see MPIVersion there; do you mean > >> BasicVersion? I built and ran it (if you did mean MPIVersion, I will get > >> PETSc-3.4 later): > >> > >> ================= > >> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 1 ./BasicVersion > >> Number of MPI processes 1 > >> Function Rate (MB/s) > >> Copy: 21682.9932 > >> Scale: 21637.5509 > >> Add: 21583.0395 > >> Triad: 21504.6563 > >> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 2 ./BasicVersion > >> Number of MPI processes 2 > >> Function Rate (MB/s) > >> Copy: 21369.6976 > >> Scale: 21632.3203 > >> Add: 22203.7107 > >> Triad: 22305.1841 > >> ======================= > >> > >> Thanks a lot, > >> Qin > >> > >> From: Barry Smith <[email protected]> > >> To: Qin Lu <[email protected]> > >> Cc: "[email protected]" <[email protected]> > >> Sent: Thursday, May 29, 2014 4:17 PM > >> Subject: Re: [petsc-users] About parallel performance > >> > >> > >> > >> You need to run the streams benchmarks are one and two processes to see > >> how the memory bandwidth changes. If you are using petsc-3.4 you can > >> > >> cd src/benchmarks/streams/ > >> > >> make MPIVersion > >> > >> mpiexec -n 1 ./MPIVersion > >> > >> mpiexec -n 2 ./MPIVersion > >> > >> and send all the results > >> > >> Barry > >> > >> > >> > >> On May 29, 2014, at 4:06 PM, Qin Lu <[email protected]> wrote: > >> > >>> For now I only care about the CPU of PETSc subroutines. I tried to add > >>> PetscLogEventBegin/End and the results are consistent with the > >>> log_summary attached in my first email. > >>> > >>> The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs > >>> are small (< 20 sec). The CPU of PCSetup/PCApply are about the same > >>> between p1 and p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a > >>> little faster than p1's (176 sec), but p2 spent more time in > >>> MatGetSubMatrice (43 sec). So the total CPU of PETSc subtroutines are > >>> about the same between p1 and p2 (502 sec vs. 488 sec). > >>> > >>> It seems I need a more efficient parallel preconditioner. Do you have any > >>> suggestions for that? > >>> > >>> Many thanks, > >>> Qin > >>> > >>> ----- Original Message ----- > >>> From: Barry Smith <[email protected]> > >>> To: Qin Lu <[email protected]> > >>> Cc: "[email protected]" <[email protected]> > >>> Sent: Thursday, May 29, 2014 2:12 PM > >>> Subject: Re: [petsc-users] About parallel performance > >>> > >>> > >>> You need to determine where the other 80% of the time is. My guess > >>> it is in setting the values into the matrix each time. Use > >>> PetscLogEventRegister() and put a PetscLogEventBegin/End() around the > >>> code that computes all the entries in the matrix and calls MatSetValues() > >>> and MatAssemblyBegin/End(). > >>> > >>> Likely the reason the linear solver does not scale better is that > >>> you have a machine with multiple cores that share the same memory > >>> bandwidth and the first core is already using well over half the memory > >>> bandwidth so the second core cannot be fully utilized since both cores > >>> have to wait for data to arrive from memory. If you are using the > >>> development version of PETSc you can run make streams NPMAX=2 from the > >>> PETSc root directory and send this to us to confirm this. > >>> > >>> Barry > >>> > >>> > >>> > >>> > >>> > >>> On May 29, 2014, at 1:23 PM, Qin Lu <[email protected]> wrote: > >>> > >>>> Hello, > >>>> > >>>> I implemented PETSc parallel linear solver in a program, the > >>>> implementation is basically the same as > >>>> /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I preallocated the > >>>> MatMPIAIJ, and let PETSc partition the matrix through > >>>> MatGetOwnershipRange. However, a few tests shows the parallel solver is > >>>> always a little slower the serial solver (I have excluded the matrix > >>>> generation CPU). > >>>> > >>>> For serial run I used PCILU as preconditioner; for parallel run, I used > >>>> ASM with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type > >>>> preonly -ksp_type bcgs -pc_type asm). The number of unknowns are around > >>>> 200,000. > >>>> > >>>> I have used -log_summary to print out the performance summary as > >>>> attached (log_summary_p1 for serial run and log_summary_p2 for the run > >>>> with 2 processes). It seems the KSPSolve counts only for less than 20% > >>>> of Global %T. > >>>> My questions are: > >>>> > >>>> 1. what is the bottle neck of the parallel run according to the summary? > >>>> 2. Do you have any suggestions to improve the parallel performance? > >>>> > >>>> Thanks a lot for your suggestions! > >>>> > >>>> Regards, > >>>> Qin <log_summary_p1.txt><log_summary_p2.txt> > >
