For now I only care about the CPU of PETSc subroutines. I tried to add 
PetscLogEventBegin/End and the results are consistent with the log_summary 
attached in my first email.
 
The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs are 
small (< 20 sec). The CPU of PCSetup/PCApply are about the same between p1 and 
p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a little faster than p1's 
(176 sec), but p2 spent more time in MatGetSubMatrice (43 sec). So the total 
CPU of PETSc subtroutines are about the same between p1 and p2 (502 sec vs. 488 
sec).

It seems I need a more efficient parallel preconditioner. Do you have any 
suggestions for that?

Many thanks,
Qin

----- Original Message -----
From: Barry Smith <[email protected]>
To: Qin Lu <[email protected]>
Cc: "[email protected]" <[email protected]>
Sent: Thursday, May 29, 2014 2:12 PM
Subject: Re: [petsc-users] About parallel performance


   You need to determine where the other 80% of the time is. My guess it is in 
setting the values into the matrix each time. Use PetscLogEventRegister() and 
put a PetscLogEventBegin/End() around the code that computes all the entries in 
the matrix and calls MatSetValues() and MatAssemblyBegin/End().

   Likely the reason the linear solver does not scale better is that you have a 
machine with multiple cores that share the same memory bandwidth and the first 
core is already using well over half the memory bandwidth so the second core 
cannot be fully utilized since both cores have to wait for data to arrive from 
memory.  If you are using the development version of PETSc you can run make 
streams NPMAX=2 from the PETSc root directory and send this to us to confirm 
this.

   Barry





On May 29, 2014, at 1:23 PM, Qin Lu <[email protected]> wrote:

> Hello,
> 
> I implemented PETSc parallel linear solver in a program, the implementation 
> is basically the same as /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I 
> preallocated the MatMPIAIJ, and let PETSc partition the matrix through 
> MatGetOwnershipRange. However, a few tests shows the parallel solver is 
> always a little slower the serial solver (I have excluded the matrix 
> generation CPU).
> 
> For serial run I used PCILU as preconditioner; for parallel run, I used ASM 
> with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type preonly 
> -ksp_type bcgs -pc_type asm). The number of unknowns are around 200,000.
>  
> I have used -log_summary to print out the performance summary as attached 
> (log_summary_p1 for serial run and log_summary_p2 for the run with 2 
> processes). It seems the KSPSolve counts only for less than 20% of Global %T. 
> My questions are:
>  
> 1. what is the bottle neck of the parallel run according to the summary?
> 2. Do you have any suggestions to improve the parallel performance?
>  
> Thanks a lot for your suggestions!
>  
> Regards,
> Qin    <log_summary_p1.txt><log_summary_p2.txt>

Reply via email to