Is this determined by how the machine was built (which I can not do anything), or by how the MPI/meassge-passing is configured at the cluster (which I can ask IT people to modify)? - this machine is actually a node of a linux cluster. Thanks, Qin
________________________________ From: Matthew Knepley <[email protected]> To: Qin Lu <[email protected]> Cc: Barry Smith <[email protected]>; petsc-users <[email protected]> Sent: Thursday, May 29, 2014 5:27 PM Subject: Re: [petsc-users] About parallel performance On Thu, May 29, 2014 at 5:15 PM, Qin Lu <[email protected]> wrote: Barry, > >How did you read the test results? For a machine good for parallism, should >the data of np=2 be about half of the those of np=1? Ideally, the numbers should be about twice as big for np = 2. >The machine has very new Intel chips and is very for serial run. What may >cause the bad parallism? - the configurations of the machine, or I am using a >MPI lib (MPICH2) that was not built correctly? > The cause is machine architecture. The memory bandwidth is only sufficient for one core. Thanks, Matt Many thanks, >Qin > >----- Original Message ----- >From: Barry Smith <[email protected]> >To: Qin Lu <[email protected]>; petsc-users <[email protected]> >Cc: >Sent: Thursday, May 29, 2014 4:54 PM >Subject: Re: [petsc-users] About parallel performance > > > In that PETSc version BasicVersion is actually the MPI streams benchmark so >you ran the right thing. Your machine is totally worthless for sparse linear >algebra parallelism. The entire memory bandwidth is used by the first core so >adding the second core to the computation gives you no improvement at all in >the streams benchmark. > > But the single core memory bandwidth is pretty good so for problems that >don’t need parallelism you should get good performance. > > Barry > > > > >On May 29, 2014, at 4:37 PM, Qin Lu <[email protected]> wrote: > >> Barry, >> >> I have PETSc-3.4.2 and I didn't see MPIVersion there; do you mean >> BasicVersion? I built and ran it (if you did mean MPIVersion, I will get >> PETSc-3.4 later): >> >> ================= >> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 1 ./BasicVersion >> Number of MPI processes 1 >> Function Rate (MB/s) >> Copy: 21682.9932 >> Scale: 21637.5509 >> Add: 21583.0395 >> Triad: 21504.6563 >> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 2 ./BasicVersion >> Number of MPI processes 2 >> Function Rate (MB/s) >> Copy: 21369.6976 >> Scale: 21632.3203 >> Add: 22203.7107 >> Triad: 22305.1841 >> ======================= >> >> Thanks a lot, >> Qin >> >> From: Barry Smith <[email protected]> >> To: Qin Lu <[email protected]> >> Cc: "[email protected]" <[email protected]> >> Sent: Thursday, May 29, 2014 4:17 PM >> Subject: Re: [petsc-users] About parallel performance >> >> >> >> You need to run the streams benchmarks are one and two processes to see >>how the memory bandwidth changes. If you are using petsc-3.4 you can >> >> cd src/benchmarks/streams/ >> >> make MPIVersion >> >> mpiexec -n 1 ./MPIVersion >> >> mpiexec -n 2 ./MPIVersion >> >> and send all the results >> >> Barry >> >> >> >> On May 29, 2014, at 4:06 PM, Qin Lu <[email protected]> wrote: >> >>> For now I only care about the CPU of PETSc subroutines. I tried to add >>> PetscLogEventBegin/End and the results are consistent with the log_summary >>> attached in my first email. >>> >>> The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs are >>> small (< 20 sec). The CPU of PCSetup/PCApply are about the same between p1 >>> and p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a little faster >>> than p1's (176 sec), but p2 spent more time in MatGetSubMatrice (43 sec). >>> So the total CPU of PETSc subtroutines are about the same between p1 and p2 >>> (502 sec vs. 488 sec). >>> >>> It seems I need a more efficient parallel preconditioner. Do you have any >>> suggestions for that? >>> >>> Many thanks, >>> Qin >>> >>> ----- Original Message ----- >>> From: Barry Smith <[email protected]> >>> To: Qin Lu <[email protected]> >>> Cc: "[email protected]" <[email protected]> >>> Sent: Thursday, May 29, 2014 2:12 PM >>> Subject: Re: [petsc-users] About parallel performance >>> >>> >>> You need to determine where the other 80% of the time is. My guess it >>>is in setting the values into the matrix each time. Use >>>PetscLogEventRegister() and put a PetscLogEventBegin/End() around the code >>>that computes all the entries in the matrix and calls MatSetValues() and >>>MatAssemblyBegin/End(). >>> >>> Likely the reason the linear solver does not scale better is that you >>>have a machine with multiple cores that share the same memory bandwidth and >>>the first core is already using well over half the memory bandwidth so the >>>second core cannot be fully utilized since both cores have to wait for data >>>to arrive from memory. If you are using the development version of PETSc >>>you can run make streams NPMAX=2 from the PETSc root directory and send this >>>to us to confirm this. >>> >>> Barry >>> >>> >>> >>> >>> >>> On May 29, 2014, at 1:23 PM, Qin Lu <[email protected]> wrote: >>> >>>> Hello, >>>> >>>> I implemented PETSc parallel linear solver in a program, the >>>> implementation is basically the same as >>>> /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I preallocated the MatMPIAIJ, >>>> and let PETSc partition the matrix through MatGetOwnershipRange. However, >>>> a few tests shows the parallel solver is always a little slower the serial >>>> solver (I have excluded the matrix generation CPU). >>>> >>>> For serial run I used PCILU as preconditioner; for parallel run, I used >>>> ASM with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type preonly >>>> -ksp_type bcgs -pc_type asm). The number of unknowns are around 200,000. >>>> >>>> I have used -log_summary to print out the performance summary as attached >>>> (log_summary_p1 for serial run and log_summary_p2 for the run with 2 >>>> processes). It seems the KSPSolve counts only for less than 20% of Global >>>> %T. >>>> My questions are: >>>> >>>> 1. what is the bottle neck of the parallel run according to the summary? >>>> 2. Do you have any suggestions to improve the parallel performance? >>>> >>>> Thanks a lot for your suggestions! >>>> >>>> Regards, >>>> Qin <log_summary_p1.txt><log_summary_p2.txt> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
