Thanks a lot! I will try that. Qin ________________________________ From: Matthew Knepley <[email protected]> To: Qin Lu <[email protected]> Cc: Barry Smith <[email protected]>; petsc-users <[email protected]> Sent: Thursday, May 29, 2014 5:45 PM Subject: Re: [petsc-users] About parallel performance
On Thu, May 29, 2014 at 5:40 PM, Qin Lu <[email protected]> wrote: Is this determined by how the machine was built (which I can not do anything), or by how the MPI/meassge-passing is configured at the cluster (which I can ask IT people to modify)? - this machine is actually a node of a linux cluster. It is determined by how the machine was built. Your best bet for scalability is to use one process per node. Thanks, Matt >Thanks, >Qin > > > From: Matthew Knepley <[email protected]> >To: Qin Lu <[email protected]> >Cc: Barry Smith <[email protected]>; petsc-users <[email protected]> >Sent: Thursday, May 29, 2014 5:27 PM >Subject: Re: [petsc-users] About parallel performance > > > >On Thu, May 29, 2014 at 5:15 PM, Qin Lu <[email protected]> wrote: > >Barry, >> >>How did you read the test results? For a machine good for parallism, should >>the data of np=2 be about half of the those of np=1? > > >Ideally, the numbers should be about twice as big for np = 2. > > >>The machine has very new Intel chips and is very for serial run. What may >>cause the bad parallism? - the configurations of the machine, or I am using a >>MPI lib (MPICH2) that was not built correctly? >> > > >The cause is machine architecture. The memory bandwidth is only sufficient for >one core. > > > Thanks, > > > Matt > > > > > >Many thanks, >>Qin >> >>----- Original Message ----- >>From: Barry Smith <[email protected]> >>To: Qin Lu <[email protected]>; petsc-users <[email protected]> >>Cc: >>Sent: Thursday, May 29, 2014 4:54 PM >>Subject: Re: [petsc-users] About parallel performance >> >> >> In that PETSc version BasicVersion is actually the MPI streams benchmark so >>you ran the right thing. Your machine is totally worthless for sparse linear >>algebra parallelism. The entire memory bandwidth is used by the first core so >>adding the second core to the computation gives you no improvement at all in >>the streams benchmark. >> >> But the single core memory bandwidth is pretty good so for problems that >>don’t need parallelism you should get good performance. >> >> Barry >> >> >> >> >>On May 29, 2014, at 4:37 PM, Qin Lu <[email protected]> wrote: >> >>> Barry, >>> >>> I have PETSc-3.4.2 and I didn't see MPIVersion there; do you mean >>> BasicVersion? I built and ran it (if you did mean MPIVersion, I will get >>> PETSc-3.4 later): >>> >>> ================= >>> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 1 ./BasicVersion >>> Number of MPI processes 1 >>> Function Rate (MB/s) >>> Copy: 21682.9932 >>> Scale: 21637.5509 >>> Add: 21583.0395 >>> Triad: 21504.6563 >>> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 2 ./BasicVersion >>> Number of MPI processes 2 >>> Function Rate (MB/s) >>> Copy: 21369.6976 >>> Scale: 21632.3203 >>> Add: 22203.7107 >>> Triad: 22305.1841 >>> ======================= >>> >>> Thanks a lot, >>> Qin >>> >>> From: Barry Smith <[email protected]> >>> To: Qin Lu <[email protected]> >>> Cc: "[email protected]" <[email protected]> >>> Sent: Thursday, May 29, 2014 4:17 PM >>> Subject: Re: [petsc-users] About parallel performance >>> >>> >>> >>> You need to run the streams benchmarks are one and two processes to see >>>how the memory bandwidth changes. If you are using petsc-3.4 you can >>> >>> cd src/benchmarks/streams/ >>> >>> make MPIVersion >>> >>> mpiexec -n 1 ./MPIVersion >>> >>> mpiexec -n 2 ./MPIVersion >>> >>> and send all the results >>> >>> Barry >>> >>> >>> >>> On May 29, 2014, at 4:06 PM, Qin Lu <[email protected]> wrote: >>> >>>> For now I only care about the CPU of PETSc subroutines. I tried to add >>>> PetscLogEventBegin/End and the results are consistent with the log_summary >>>> attached in my first email. >>>> >>>> The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs >>>> are small (< 20 sec). The CPU of PCSetup/PCApply are about the same >>>> between p1 and p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a >>>> little faster than p1's (176 sec), but p2 spent more time in >>>> MatGetSubMatrice (43 sec). So the total CPU of PETSc subtroutines are >>>> about the same between p1 and p2 (502 sec vs. 488 sec). >>>> >>>> It seems I need a more efficient parallel preconditioner. Do you have any >>>> suggestions for that? >>>> >>>> Many thanks, >>>> Qin >>>> >>>> ----- Original Message ----- >>>> From: Barry Smith <[email protected]> >>>> To: Qin Lu <[email protected]> >>>> Cc: "[email protected]" <[email protected]> >>>> Sent: Thursday, May 29, 2014 2:12 PM >>>> Subject: Re: [petsc-users] About parallel performance >>>> >>>> >>>> You need to determine where the other 80% of the time is. My guess it >>>>is in setting the values into the matrix each time. Use >>>>PetscLogEventRegister() and put a PetscLogEventBegin/End() around the code >>>>that computes all the entries in the matrix and calls MatSetValues() and >>>>MatAssemblyBegin/End(). >>>> >>>> Likely the reason the linear solver does not scale better is that you >>>>have a machine with multiple cores that share the same memory bandwidth and >>>>the first core is already using well over half the memory bandwidth so the >>>>second core cannot be fully utilized since both cores have to wait for data >>>>to arrive from memory. If you are using the development version of PETSc >>>>you can run make streams NPMAX=2 from the PETSc root directory and send >>>>this to us to confirm this. >>>> >>>> Barry >>>> >>>> >>>> >>>> >>>> >>>> On May 29, 2014, at 1:23 PM, Qin Lu <[email protected]> wrote: >>>> >>>>> Hello, >>>>> >>>>> I implemented PETSc parallel linear solver in a program, the >>>>> implementation is basically the same as >>>>> /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I preallocated the >>>>> MatMPIAIJ, and let PETSc partition the matrix through >>>>> MatGetOwnershipRange. However, a few tests shows the parallel solver is >>>>> always a little slower the serial solver (I have excluded the matrix >>>>> generation CPU). >>>>> >>>>> For serial run I used PCILU as preconditioner; for parallel run, I used >>>>> ASM with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type preonly >>>>> -ksp_type bcgs -pc_type asm). The number of unknowns are around 200,000. >>>>> >>>>> I have used -log_summary to print out the performance summary as attached >>>>> (log_summary_p1 for serial run and log_summary_p2 for the run with 2 >>>>> processes). It seems the KSPSolve counts only for less than 20% of Global >>>>> %T. >>>>> My questions are: >>>>> >>>>> 1. what is the bottle neck of the parallel run according to the summary? >>>>> 2. Do you have any suggestions to improve the parallel performance? >>>>> >>>>> Thanks a lot for your suggestions! >>>>> >>>>> Regards, >>>>> Qin <log_summary_p1.txt><log_summary_p2.txt> >> > > > >-- >What most experimenters take for granted before they begin their experiments >is infinitely more interesting than any results to which their experiments >lead. >-- Norbert Wiener > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
