Re: [petsc-users] About parallel performance

Qin Lu Thu, 29 May 2014 15:50:19 -0700

Barry,
 
Thanks a lot for the info! I know now what was the problem. 
 
Qin


________________________________
 From: Barry Smith <[email protected]>
To: Qin Lu <[email protected]> 
Cc: petsc-users <[email protected]> 
Sent: Thursday, May 29, 2014 5:46 PM
Subject: Re: [petsc-users] About parallel performance
  


   For the parallel case a perfect machine would have twice the memory 
bandwidth when using 2 cores as opposed to 1 core. For yours it is almost 
exactly the same. The issue is not with the MPI or software. It depends on how 
many memory sockets there are and how they are shared by the various cores. As 
I said the initial memory bandwidth for one core 21,682. gigabytes per second 
is good so it is a very good sequential machine. 

  Here are the results on my laptop 

Number of MPI processes 1
Process 0 Barrys-MacBook-Pro.local
Function      Rate (MB/s) 
Copy:        7928.7346
Scale:       8271.5103
Add:        11017.0430
Triad:      10843.9018
Number of MPI processes 2
Process 0 Barrys-MacBook-Pro.local
Process 1 Barrys-MacBook-Pro.local
Function      Rate (MB/s) 
Copy:       13513.0365
Scale:      13516.7086
Add:        15455.3952
Triad:      15562.0822
------------------------------------------------
np  speedup
1 1.0
2 1.44


Note that the memory bandwidth is much lower than your machine but there is an 
increase in speedup from one to two cores because one core cannot utilize all 
the memory bandwidth. But even with two cores my laptop will be slower on PETSc 
then one core on your machine.

Here is the performance on a workstation we have that has multiple CPUs and 
multiple memory sockets

Number of MPI processes 1
Process 0 es
Function      Rate (MB/s) 
Copy:       13077.8260
Scale:      12867.1966
Add:        14637.6757
Triad:      14414.4478
Number of MPI processes 2
Process 0 es
Process 1 es
Function      Rate (MB/s) 
Copy:       22663.3116
Scale:      22102.5495
Add:        25768.1550
Triad:      26076.0410
Number of MPI processes 3
Process 0 es
Process 1 es
Process 2 es
Function      Rate (MB/s) 
Copy:       27501.7610
Scale:      26971.2183
Add:        30433.3276
Triad:      31302.9396
Number of MPI processes 4
Process 0 es
Process 1 es
Process 2 es
Process 3 es
Function      Rate (MB/s) 
Copy:       29302.3183
Scale:      30165.5295
Add:        34577.3458
Triad:      35195.8067
------------------------------------------------
np  speedup
1 1.0
2 1.81
3 2.17
4 2.44

Note that one core has a lower memory bandwidth than your machine but as I add 
more cores the memory bandwidth increases by a factor of 2.4

There is nothing wrong with your machine, it is just not suitable to run sparse 
linear algebra on multiple cores for it.

  Barry





On May 29, 2014, at 5:15 PM, Qin Lu <[email protected]> wrote:

> Barry,
>  
> How did you read the test results? For a machine good for parallism, should 
> the data of np=2 be about half of the those of np=1?
>  
> The machine has very new Intel chips and is very for serial run. What may 
> cause the bad parallism? - the configurations of the machine, or I am using a 
> MPI lib (MPICH2) that was not built correctly?
> Many thanks,
> Qin
>  
> ----- Original Message -----
> From: Barry Smith <[email protected]>
> To: Qin Lu <[email protected]>; petsc-users <[email protected]>
> Cc: 
> Sent: Thursday, May 29, 2014 4:54 PM
> Subject: Re: [petsc-users] About parallel performance
> 
> 
>   In that PETSc version BasicVersion is actually the MPI streams benchmark so 
>you ran the right thing. Your machine is totally worthless for sparse linear 
>algebra parallelism. The entire memory bandwidth is used by the first core so 
>adding the second core to the computation gives you no improvement at all in 
>the streams benchmark. 
> 
>   But the single core memory bandwidth is pretty good so for problems that 
>don’t need parallelism you should get good performance.
> 
>    Barry
> 
> 
> 
> 
> On May 29, 2014, at 4:37 PM, Qin Lu <[email protected]> wrote:
> 
>> Barry,
>> 
>> I have PETSc-3.4.2 and I didn't see MPIVersion there; do you mean 
>> BasicVersion? I built and ran it (if you did mean MPIVersion, I will get 
>> PETSc-3.4 later):
>> 
>> =================
>> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 1 ./BasicVersion
>> Number of MPI processes 1
>> Function      Rate (MB/s)
>> Copy:       21682.9932
>> Scale:      21637.5509
>> Add:        21583.0395
>> Triad:      21504.6563
>> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 2 ./BasicVersion
>> Number of MPI processes 2
>> Function      Rate (MB/s)
>> Copy:       21369.6976
>> Scale:      21632.3203
>> Add:        22203.7107
>> Triad:      22305.1841
>> =======================
>> 
>> Thanks a lot,
>> Qin
>> 
>> From: Barry Smith <[email protected]>
>> To: Qin Lu <[email protected]> 
>> Cc: "[email protected]" <[email protected]> 
>> Sent: Thursday, May 29, 2014 4:17 PM
>> Subject: Re: [petsc-users] About parallel performance
>> 
>> 
>> 
>>    You need to run the streams benchmarks are one and two processes to see 
>>how the memory bandwidth changes. If you are using petsc-3.4 you can 
>> 
>> cd  src/benchmarks/streams/ 
>> 
>> make MPIVersion
>> 
>> mpiexec -n 1 ./MPIVersion
>> 
>> mpiexec -n 2 ./MPIVersion 
>> 
>>     and send all the results
>> 
>>     Barry
>> 
>> 
>> 
>> On May 29, 2014, at 4:06 PM, Qin Lu <[email protected]> wrote:
>> 
>>> For now I only care about the CPU of PETSc subroutines. I tried to add 
>>> PetscLogEventBegin/End and the results are consistent with the log_summary 
>>> attached in my first email.
>>>  
>>> The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs are 
>>> small (< 20 sec). The CPU of PCSetup/PCApply are about the same between p1 
>>> and p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a little faster 
>>> than p1's (176 sec), but p2 spent more time in MatGetSubMatrice (43 sec). 
>>> So the total CPU of PETSc subtroutines are about the same between p1 and p2 
>>> (502 sec vs. 488 sec).
>>> 
>>> It seems I need a more efficient parallel preconditioner. Do you have any 
>>> suggestions for that?
>>> 
>>> Many thanks,
>>> Qin
>>> 
>>> ----- Original Message -----
>>> From: Barry Smith <[email protected]>
>>> To: Qin Lu <[email protected]>
>>> Cc: "[email protected]" <[email protected]>
>>> Sent: Thursday, May 29, 2014 2:12 PM
>>> Subject: Re: [petsc-users] About parallel performance
>>> 
>>> 
>>>      You need to determine where the other 80% of the time is. My guess it 
>>>is in setting the values into the matrix each time. Use 
>>>PetscLogEventRegister() and put a PetscLogEventBegin/End() around the code 
>>>that computes all the entries in the matrix and calls MatSetValues() and 
>>>MatAssemblyBegin/End().
>>> 
>>>      Likely the reason the linear solver does not scale better is that you 
>>>have a machine with multiple cores that share the same memory bandwidth and 
>>>the first core is already using well over half the memory bandwidth so the 
>>>second core cannot be fully utilized since both cores have to wait for data 
>>>to arrive from memory.  If you are using the development version of PETSc 
>>>you can run make streams NPMAX=2 from the PETSc root directory and send this 
>>>to us to confirm this.
>>> 
>>>      Barry
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On May 29, 2014, at 1:23 PM, Qin Lu <[email protected]> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I implemented PETSc parallel linear solver in a program, the 
>>>> implementation is basically the same as 
>>>> /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I preallocated the MatMPIAIJ, 
>>>> and let PETSc partition the matrix through MatGetOwnershipRange. However, 
>>>> a few tests shows the parallel solver is always a little slower the serial 
>>>> solver (I have excluded the matrix generation CPU).
>>>> 
>>>> For serial run I used PCILU as preconditioner; for parallel run, I used 
>>>> ASM with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type preonly 
>>>> -ksp_type bcgs -pc_type asm). The number of unknowns are around 200,000.
>>>>  
>>>> I have used -log_summary to print out the performance summary as attached 
>>>> (log_summary_p1 for serial run and log_summary_p2 for the run with 2 
>>>> processes). It seems the KSPSolve counts only for less than 20% of Global 
>>>> %T. 
>>>> My questions are:
>>>>  
>>>> 1. what is the bottle neck of the parallel run according to the summary?
>>>> 2. Do you have any suggestions to improve the parallel performance?
>>>>  
>>>> Thanks a lot for your suggestions!
>>>>  
>>>> Regards,
>>>> Qin    <log_summary_p1.txt><log_summary_p2.txt>

Re: [petsc-users] About parallel performance

Reply via email to