Ah! One of those files *was* from a different run. Here are the correct
files for comparison (this time compiled with -O3). Sorry for the confusion.

Cheers
Gerard

Mark F. Adams emailed the following on 17/03/12 15:13:
> The two tests are bit different: 1) one has "mystatge" and 1) one does about 
> 50% more iterations for some reason.  It would make me feel better if the 
> semantics of both programs were the same.  Why are you getting more 
> iterations with pure MPI?
>
> That said, looking at the flop rates, the output does not seem to match your 
> plots:  the output files have almost the same flop rates for MatMult (but 
> there is a load balance of 1.4 for pure MPI, so I would look at the 
> partitioning of the problem and try to get perfect load balance, which should 
> not be hard, I would think ...).  KSPSolve is degraded by the vector 
> operations (AXPPY, norms, VecPointwiseMult, etc.) in pure OpenMP.  This is 
> counterintuitive, there is no communication in these routines and they are 
> dead simple.  I would verify this carefully with the exact same test and if 
> it holds up dig in with perf tools.
>
> Mark
>
> On Mar 17, 2012, at 4:33 AM, Gerard Gorman wrote:
>
>> Hi
>>
>> We have profiled on Cray compute nodes with two 16-core AMD Opteron
>> 2.3GHz Interlagos processors, using the same matrix but this time with
>> -ksp_type cg and -pc_type jacobi. Attached are the logs with the 32 MPI
>> processes and the 32 OpenMP threads tests.
>>
>> Most of the time is in stage 2. As seen previously, MatMult is
>> performing well, but the overall performance in KSPSolve drops for
>> OpenMP. I have attached a plot of the (hybrid mpi+openmp time)/(pure
>> openmp) where all 32 cores are always used. What the graph shows is that
>> we are always getting better performance in MatMult for pure OpenMP but
>> there is something additional in KSPSolve that degrades the OpenMP
>> performance.
>>
>> So far we have profiled with oprofile measuring the event
>> CPU_CLK_UNHALTED, but this has not shown up the bottleneck. So more
>> digging is required.
>>
>> Any suggestions/comments gratefully received. 
>>
>> Cheers
>> Gerard
>>
>> Gerard Gorman emailed the following on 14/03/12 16:59:
>>> Hi
>>>
>>> Since Vec and most of Mat is now threaded we have started to do more
>>> detailed profiling. I'm posting these initial tasters from a two socket
>>> Intel Core Bloomfield processor system (i.e. 8 cores) to stimulate
>>> discussion.
>>>
>>> The matrix comes from a 3D lock exchange problem discretised using a
>>> continuous Galerkin finite element formulation and has about 450k
>>> degrees of freedom.
>>>
>>> I have configured the simulator (Fluidity -
>>> http://amcg.ese.ic.ac.uk/Fluidity) to dump out PETSc matrices at each
>>> solve. These individual matrices are then solved using
>>> petsc-dev/src/ksp/ksp/examples/tests/ex6 compiled with GCC 4.6.3
>>> --with-debugging=0.
>>>
>>> The PETSc options are:
>>> -get_total_flops -pc_type gamg -ksp_type cg -ksp_rtol 1.0e-6 -log_summary
>>>
>>> The 3 log files attached are for OMP_NUM_THREADS=1, OMP_NUM_THREADS=8
>>> and non-threaded MPI run with 8 processes for comparison.
>>>
>>> So the reason this benchmark is interesting is because it is pressure
>>> which is really stiff , and it uses GAMG as a blackbox.
>>>
>>> Using xxdiff to compare the logs I think the interesting points are:
>>> - Overall OpenMP compares favourably with MPI.
>>> - OpenMP converged in 2 less iterations than with MPI. Earlier I was
>>> expecting fewer iterations simply because of the absence of partitions
>>> to diminish the effectiveness of coarsening. I have not been following
>>> Mark's GAMG development but it looks repartitioning is being used to get
>>> around that issue (?). However, the biggest plus is because Chebychev is
>>> used as a smoother (rather than something difficult to parallelise like
>>> SSOR), GAMG appears to scale pretty well when threaded with OpenMP.
>>> - Important operations like MatMult etc perform well.
>>> - From the summary, "mystage 1" is the main section where OMP appears to
>>> need more work. We suffer from operations such as  MatPtAP and
>>> MatTrnMatMult for example which we have not got around to looking at yet.
>>>
>>> As this is a relatively small and boring UMA machine I have not bothered
>>> with scaling curves. We are setting the same benchmark up on 32-core
>>> Interlagos compute nodes at the moment - hopefully these will be ready
>>> by tomorrow.
>>>
>>> Comments welcome.
>>>
>>> Cheers
>>> Gerard
>>>
>> <pressure-matrix-cg-32mpi.dat><pressure-matrix-cg-jacobi-1mpi-32omp.dat><pressure-matrix-cg-hybrid_speedup.pdf>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: compare.tar.gz
Type: application/x-gzip
Size: 3455 bytes
Desc: not available
URL: 
<http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120317/980418f4/attachment.bin>

Reply via email to