You need to run the streams benchmarks are one and two processes to see how 
the memory bandwidth changes. If you are using petsc-3.4 you can 

 cd  src/benchmarks/streams/ 

 make MPIVersion

 mpiexec -n 1 ./MPIVersion

 mpiexec -n 2 ./MPIVersion 

   and send all the results

   Barry


On May 29, 2014, at 4:06 PM, Qin Lu <[email protected]> wrote:

> For now I only care about the CPU of PETSc subroutines. I tried to add 
> PetscLogEventBegin/End and the results are consistent with the log_summary 
> attached in my first email.
>  
> The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs are 
> small (< 20 sec). The CPU of PCSetup/PCApply are about the same between p1 
> and p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a little faster 
> than p1's (176 sec), but p2 spent more time in MatGetSubMatrice (43 sec). So 
> the total CPU of PETSc subtroutines are about the same between p1 and p2 (502 
> sec vs. 488 sec).
> 
> It seems I need a more efficient parallel preconditioner. Do you have any 
> suggestions for that?
> 
> Many thanks,
> Qin
> 
> ----- Original Message -----
> From: Barry Smith <[email protected]>
> To: Qin Lu <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Sent: Thursday, May 29, 2014 2:12 PM
> Subject: Re: [petsc-users] About parallel performance
> 
> 
>    You need to determine where the other 80% of the time is. My guess it is 
> in setting the values into the matrix each time. Use PetscLogEventRegister() 
> and put a PetscLogEventBegin/End() around the code that computes all the 
> entries in the matrix and calls MatSetValues() and MatAssemblyBegin/End().
> 
>    Likely the reason the linear solver does not scale better is that you have 
> a machine with multiple cores that share the same memory bandwidth and the 
> first core is already using well over half the memory bandwidth so the second 
> core cannot be fully utilized since both cores have to wait for data to 
> arrive from memory.  If you are using the development version of PETSc you 
> can run make streams NPMAX=2 from the PETSc root directory and send this to 
> us to confirm this.
> 
>    Barry
> 
> 
> 
> 
> 
> On May 29, 2014, at 1:23 PM, Qin Lu <[email protected]> wrote:
> 
>> Hello,
>> 
>> I implemented PETSc parallel linear solver in a program, the implementation 
>> is basically the same as /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I 
>> preallocated the MatMPIAIJ, and let PETSc partition the matrix through 
>> MatGetOwnershipRange. However, a few tests shows the parallel solver is 
>> always a little slower the serial solver (I have excluded the matrix 
>> generation CPU).
>> 
>> For serial run I used PCILU as preconditioner; for parallel run, I used ASM 
>> with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type preonly 
>> -ksp_type bcgs -pc_type asm). The number of unknowns are around 200,000.
>>   
>> I have used -log_summary to print out the performance summary as attached 
>> (log_summary_p1 for serial run and log_summary_p2 for the run with 2 
>> processes). It seems the KSPSolve counts only for less than 20% of Global 
>> %T. 
>> My questions are:
>>   
>> 1. what is the bottle neck of the parallel run according to the summary?
>> 2. Do you have any suggestions to improve the parallel performance?
>>   
>> Thanks a lot for your suggestions!
>>   
>> Regards,
>> Qin    <log_summary_p1.txt><log_summary_p2.txt>

Reply via email to