You need to run the streams benchmarks are one and two processes to see how the memory bandwidth changes. If you are using petsc-3.4 you can
cd src/benchmarks/streams/ make MPIVersion mpiexec -n 1 ./MPIVersion mpiexec -n 2 ./MPIVersion and send all the results Barry On May 29, 2014, at 4:06 PM, Qin Lu <[email protected]> wrote: > For now I only care about the CPU of PETSc subroutines. I tried to add > PetscLogEventBegin/End and the results are consistent with the log_summary > attached in my first email. > > The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs are > small (< 20 sec). The CPU of PCSetup/PCApply are about the same between p1 > and p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a little faster > than p1's (176 sec), but p2 spent more time in MatGetSubMatrice (43 sec). So > the total CPU of PETSc subtroutines are about the same between p1 and p2 (502 > sec vs. 488 sec). > > It seems I need a more efficient parallel preconditioner. Do you have any > suggestions for that? > > Many thanks, > Qin > > ----- Original Message ----- > From: Barry Smith <[email protected]> > To: Qin Lu <[email protected]> > Cc: "[email protected]" <[email protected]> > Sent: Thursday, May 29, 2014 2:12 PM > Subject: Re: [petsc-users] About parallel performance > > > You need to determine where the other 80% of the time is. My guess it is > in setting the values into the matrix each time. Use PetscLogEventRegister() > and put a PetscLogEventBegin/End() around the code that computes all the > entries in the matrix and calls MatSetValues() and MatAssemblyBegin/End(). > > Likely the reason the linear solver does not scale better is that you have > a machine with multiple cores that share the same memory bandwidth and the > first core is already using well over half the memory bandwidth so the second > core cannot be fully utilized since both cores have to wait for data to > arrive from memory. If you are using the development version of PETSc you > can run make streams NPMAX=2 from the PETSc root directory and send this to > us to confirm this. > > Barry > > > > > > On May 29, 2014, at 1:23 PM, Qin Lu <[email protected]> wrote: > >> Hello, >> >> I implemented PETSc parallel linear solver in a program, the implementation >> is basically the same as /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I >> preallocated the MatMPIAIJ, and let PETSc partition the matrix through >> MatGetOwnershipRange. However, a few tests shows the parallel solver is >> always a little slower the serial solver (I have excluded the matrix >> generation CPU). >> >> For serial run I used PCILU as preconditioner; for parallel run, I used ASM >> with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type preonly >> -ksp_type bcgs -pc_type asm). The number of unknowns are around 200,000. >> >> I have used -log_summary to print out the performance summary as attached >> (log_summary_p1 for serial run and log_summary_p2 for the run with 2 >> processes). It seems the KSPSolve counts only for less than 20% of Global >> %T. >> My questions are: >> >> 1. what is the bottle neck of the parallel run according to the summary? >> 2. Do you have any suggestions to improve the parallel performance? >> >> Thanks a lot for your suggestions! >> >> Regards, >> Qin <log_summary_p1.txt><log_summary_p2.txt>
