Thanks a lot! I will try that.
 
Qin 
 

________________________________
 From: Matthew Knepley <[email protected]>
To: Qin Lu <[email protected]> 
Cc: Barry Smith <[email protected]>; petsc-users <[email protected]> 
Sent: Thursday, May 29, 2014 5:45 PM
Subject: Re: [petsc-users] About parallel performance
  


On Thu, May 29, 2014 at 5:40 PM, Qin Lu <[email protected]> wrote:

Is this determined by how the machine was built (which I can not do anything), 
or by how the MPI/meassge-passing is configured at the cluster (which I can ask 
IT people to modify)? - this machine is actually a node of a linux cluster. 

It is determined by how the machine was built. Your best bet for scalability is 
to use one process per node.

  Thanks,

     Matt 

 
>Thanks,
>Qin 
>
> 
> From: Matthew Knepley <[email protected]>
>To: Qin Lu <[email protected]> 
>Cc: Barry Smith <[email protected]>; petsc-users <[email protected]> 
>Sent: Thursday, May 29, 2014 5:27 PM
>Subject: Re: [petsc-users] About parallel performance
>  
>
>
>On Thu, May 29, 2014 at 5:15 PM, Qin Lu <[email protected]> wrote:
>
>Barry,
>> 
>>How did you read the test results? For a machine good for parallism, should 
>>the data of np=2 be about half of the those of np=1?
>
>
>Ideally, the numbers should be about twice as big for np = 2. 
>
> 
>>The machine has very new Intel chips and is very for serial run. What may 
>>cause the bad parallism? - the configurations of the machine, or I am using a 
>>MPI lib (MPICH2) that was not built correctly?
>> 
>
>
>The cause is machine architecture. The memory bandwidth is only sufficient for 
>one core.
>
>
>  Thanks,
>
>
>     Matt
>
>
>
>
>
>Many thanks,
>>Qin
>> 
>>----- Original Message -----
>>From: Barry Smith <[email protected]>
>>To: Qin Lu <[email protected]>; petsc-users <[email protected]>
>>Cc:
>>Sent: Thursday, May 29, 2014 4:54 PM
>>Subject: Re: [petsc-users] About parallel performance
>>
>>
>>  In that PETSc version BasicVersion is actually the MPI streams benchmark so 
>>you ran the right thing. Your machine is totally worthless for sparse linear 
>>algebra parallelism. The entire memory bandwidth is used by the first core so 
>>adding the second core to the computation gives you no improvement at all in 
>>the streams benchmark.
>>
>>  But the single core memory bandwidth is pretty good so for problems that 
>>don’t need parallelism you should get good performance.
>>
>>   Barry
>>
>>
>>
>>
>>On May 29, 2014, at 4:37 PM, Qin Lu <[email protected]> wrote:
>>
>>> Barry,
>>>
>>> I have PETSc-3.4.2 and I didn't see MPIVersion there; do you mean 
>>> BasicVersion? I built and ran it (if you did mean MPIVersion, I will get 
>>> PETSc-3.4 later):
>>>
>>> =================
>>> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 1 ./BasicVersion
>>> Number of MPI processes 1
>>> Function      Rate (MB/s)
>>> Copy:       21682.9932
>>> Scale:      21637.5509
>>> Add:        21583.0395
>>> Triad:      21504.6563
>>> [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 2 ./BasicVersion
>>> Number of MPI processes 2
>>> Function      Rate (MB/s)
>>> Copy:       21369.6976
>>> Scale:      21632.3203
>>> Add:        22203.7107
>>> Triad:      22305.1841
>>> =======================
>>>
>>> Thanks a lot,
>>> Qin
>>>
>>> From: Barry Smith <[email protected]>
>>> To: Qin Lu <[email protected]>
>>> Cc: "[email protected]" <[email protected]>
>>> Sent: Thursday, May 29, 2014 4:17 PM
>>> Subject: Re: [petsc-users] About parallel performance
>>>
>>>
>>>
>>>   You need to run the streams benchmarks are one and two processes to see 
>>>how the memory bandwidth changes. If you are using petsc-3.4 you can
>>>
>>> cd  src/benchmarks/streams/
>>>
>>> make MPIVersion
>>>
>>> mpiexec -n 1 ./MPIVersion
>>>
>>> mpiexec -n 2 ./MPIVersion
>>>
>>>    and send all the results
>>>
>>>    Barry
>>>
>>>
>>>
>>> On May 29, 2014, at 4:06 PM, Qin Lu <[email protected]> wrote:
>>>
>>>> For now I only care about the CPU of PETSc subroutines. I tried to add 
>>>> PetscLogEventBegin/End and the results are consistent with the log_summary 
>>>> attached in my first email.
>>>> 
>>>> The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs 
>>>> are small (< 20 sec). The CPU of PCSetup/PCApply are about the same 
>>>> between p1 and p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a 
>>>> little faster than p1's (176 sec), but p2 spent more time in 
>>>> MatGetSubMatrice (43 sec). So the total CPU of PETSc subtroutines are 
>>>> about the same between p1 and p2 (502 sec vs. 488 sec).
>>>>
>>>> It seems I need a more efficient parallel preconditioner. Do you have any 
>>>> suggestions for that?
>>>>
>>>> Many thanks,
>>>> Qin
>>>>
>>>> ----- Original Message -----
>>>> From: Barry Smith <[email protected]>
>>>> To: Qin Lu <[email protected]>
>>>> Cc: "[email protected]" <[email protected]>
>>>> Sent: Thursday, May 29, 2014 2:12 PM
>>>> Subject: Re: [petsc-users] About parallel performance
>>>>
>>>>
>>>>     You need to determine where the other 80% of the time is. My guess it 
>>>>is in setting the values into the matrix each time. Use 
>>>>PetscLogEventRegister() and put a PetscLogEventBegin/End() around the code 
>>>>that computes all the entries in the matrix and calls MatSetValues() and 
>>>>MatAssemblyBegin/End().
>>>>
>>>>     Likely the reason the linear solver does not scale better is that you 
>>>>have a machine with multiple cores that share the same memory bandwidth and 
>>>>the first core is already using well over half the memory bandwidth so the 
>>>>second core cannot be fully utilized since both cores have to wait for data 
>>>>to arrive from memory.  If you are using the development version of PETSc 
>>>>you can run make streams NPMAX=2 from the PETSc root directory and send 
>>>>this to us to confirm this.
>>>>
>>>>     Barry
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On May 29, 2014, at 1:23 PM, Qin Lu <[email protected]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I implemented PETSc parallel linear solver in a program, the 
>>>>> implementation is basically the same as 
>>>>> /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I preallocated the 
>>>>> MatMPIAIJ, and let PETSc partition the matrix through 
>>>>> MatGetOwnershipRange. However, a few tests shows the parallel solver is 
>>>>> always a little slower the serial solver (I have excluded the matrix 
>>>>> generation CPU).
>>>>>
>>>>> For serial run I used PCILU as preconditioner; for parallel run, I used 
>>>>> ASM with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type preonly 
>>>>> -ksp_type bcgs -pc_type asm). The number of unknowns are around 200,000.
>>>>> 
>>>>> I have used -log_summary to print out the performance summary as attached 
>>>>> (log_summary_p1 for serial run and log_summary_p2 for the run with 2 
>>>>> processes). It seems the KSPSolve counts only for less than 20% of Global 
>>>>> %T.
>>>>> My questions are:
>>>>> 
>>>>> 1. what is the bottle neck of the parallel run according to the summary?
>>>>> 2. Do you have any suggestions to improve the parallel performance?
>>>>> 
>>>>> Thanks a lot for your suggestions!
>>>>> 
>>>>> Regards,
>>>>> Qin    <log_summary_p1.txt><log_summary_p2.txt>     
>>
>
>
>
>-- 
>What most experimenters take for granted before they begin their experiments 
>is infinitely more interesting than any results to which their experiments 
>lead.
>-- Norbert Wiener 
>
>
>
>
>
>



-- 
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener 

Reply via email to