Is this determined by how the machine was built (which I can not do anything), 
or by how the MPI/meassge-passing is configured at the cluster (which I can ask 
IT people to modify)? - this machine is actually a node of a linux cluster.
 
Thanks,
Qin 
 

________________________________
 From: Matthew Knepley <[email protected]>
To: Qin Lu <[email protected]> 
Cc: Barry Smith <[email protected]>; petsc-users <[email protected]> 
Sent: Thursday, May 29, 2014 5:27 PM
Subject: Re: [petsc-users] About parallel performance
  


On Thu, May 29, 2014 at 5:15 PM, Qin Lu <[email protected]> wrote:

Barry,
> 
>How did you read the test results? For a machine good for parallism, should 
>the data of np=2 be about half of the those of np=1?

Ideally, the numbers should be about twice as big for np = 2. 

 
>The machine has very new Intel chips and is very for serial run. What may 
>cause the bad parallism? - the configurations of the machine, or I am using a 
>MPI lib (MPICH2) that was not built correctly?
>

The cause is machine architecture. The memory bandwidth is only sufficient for 
one core.

  Thanks,

     Matt



Many thanks,
>Qin
> 
>----- Original Message -----
>From: Barry Smith <[email protected]>
>To: Qin Lu <[email protected]>; petsc-users <[email protected]>
>Cc:
>Sent: Thursday, May 29, 2014 4:54 PM
>Subject: Re: [petsc-users] About parallel performance
>
>
>  In that PETSc version BasicVersion is actually the MPI streams benchmark so 
>you ran the right thing. Your machine is totally worthless for sparse linear 
>algebra parallelism. The entire memory bandwidth is used by the first core so 
>adding the second core to the computation gives you no improvement at all in 
>the streams benchmark.
>
>  But the single core memory bandwidth is pretty good so for problems that 
>don’t need parallelism you should get good performance.
>
>   Barry
>
>
>
>
>On May 29, 2014, at 4:37 PM, Qin Lu <[email protected]> wrote:
>
>> Barry,
>>
>> I have PETSc-3.4.2 and I didn't see MPIVersion there; do you mean 
>> BasicVersion? I built and ran it (if you did mean MPIVersion, I will get 
>> PETSc-3.4 later):
>>
>> =================
>> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 1 ./BasicVersion
>> Number of MPI processes 1
>> Function      Rate (MB/s)
>> Copy:       21682.9932
>> Scale:      21637.5509
>> Add:        21583.0395
>> Triad:      21504.6563
>> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 2 ./BasicVersion
>> Number of MPI processes 2
>> Function      Rate (MB/s)
>> Copy:       21369.6976
>> Scale:      21632.3203
>> Add:        22203.7107
>> Triad:      22305.1841
>> =======================
>>
>> Thanks a lot,
>> Qin
>>
>> From: Barry Smith <[email protected]>
>> To: Qin Lu <[email protected]>
>> Cc: "[email protected]" <[email protected]>
>> Sent: Thursday, May 29, 2014 4:17 PM
>> Subject: Re: [petsc-users] About parallel performance
>>
>>
>>
>>   You need to run the streams benchmarks are one and two processes to see 
>>how the memory bandwidth changes. If you are using petsc-3.4 you can
>>
>> cd  src/benchmarks/streams/
>>
>> make MPIVersion
>>
>> mpiexec -n 1 ./MPIVersion
>>
>> mpiexec -n 2 ./MPIVersion
>>
>>    and send all the results
>>
>>    Barry
>>
>>
>>
>> On May 29, 2014, at 4:06 PM, Qin Lu <[email protected]> wrote:
>>
>>> For now I only care about the CPU of PETSc subroutines. I tried to add 
>>> PetscLogEventBegin/End and the results are consistent with the log_summary 
>>> attached in my first email.
>>> 
>>> The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs are 
>>> small (< 20 sec). The CPU of PCSetup/PCApply are about the same between p1 
>>> and p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a little faster 
>>> than p1's (176 sec), but p2 spent more time in MatGetSubMatrice (43 sec). 
>>> So the total CPU of PETSc subtroutines are about the same between p1 and p2 
>>> (502 sec vs. 488 sec).
>>>
>>> It seems I need a more efficient parallel preconditioner. Do you have any 
>>> suggestions for that?
>>>
>>> Many thanks,
>>> Qin
>>>
>>> ----- Original Message -----
>>> From: Barry Smith <[email protected]>
>>> To: Qin Lu <[email protected]>
>>> Cc: "[email protected]" <[email protected]>
>>> Sent: Thursday, May 29, 2014 2:12 PM
>>> Subject: Re: [petsc-users] About parallel performance
>>>
>>>
>>>     You need to determine where the other 80% of the time is. My guess it 
>>>is in setting the values into the matrix each time. Use 
>>>PetscLogEventRegister() and put a PetscLogEventBegin/End() around the code 
>>>that computes all the entries in the matrix and calls MatSetValues() and 
>>>MatAssemblyBegin/End().
>>>
>>>     Likely the reason the linear solver does not scale better is that you 
>>>have a machine with multiple cores that share the same memory bandwidth and 
>>>the first core is already using well over half the memory bandwidth so the 
>>>second core cannot be fully utilized since both cores have to wait for data 
>>>to arrive from memory.  If you are using the development version of PETSc 
>>>you can run make streams NPMAX=2 from the PETSc root directory and send this 
>>>to us to confirm this.
>>>
>>>     Barry
>>>
>>>
>>>
>>>
>>>
>>> On May 29, 2014, at 1:23 PM, Qin Lu <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I implemented PETSc parallel linear solver in a program, the 
>>>> implementation is basically the same as 
>>>> /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I preallocated the MatMPIAIJ, 
>>>> and let PETSc partition the matrix through MatGetOwnershipRange. However, 
>>>> a few tests shows the parallel solver is always a little slower the serial 
>>>> solver (I have excluded the matrix generation CPU).
>>>>
>>>> For serial run I used PCILU as preconditioner; for parallel run, I used 
>>>> ASM with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type preonly 
>>>> -ksp_type bcgs -pc_type asm). The number of unknowns are around 200,000.
>>>> 
>>>> I have used -log_summary to print out the performance summary as attached 
>>>> (log_summary_p1 for serial run and log_summary_p2 for the run with 2 
>>>> processes). It seems the KSPSolve counts only for less than 20% of Global 
>>>> %T.
>>>> My questions are:
>>>> 
>>>> 1. what is the bottle neck of the parallel run according to the summary?
>>>> 2. Do you have any suggestions to improve the parallel performance?
>>>> 
>>>> Thanks a lot for your suggestions!
>>>> 
>>>> Regards,
>>>> Qin    <log_summary_p1.txt><log_summary_p2.txt>     
>


-- 
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener 

Reply via email to