Ah! One of those files *was* from a different run. Here are the correct files for comparison (this time compiled with -O3). Sorry for the confusion.
Cheers Gerard Mark F. Adams emailed the following on 17/03/12 15:13: > The two tests are bit different: 1) one has "mystatge" and 1) one does about > 50% more iterations for some reason. It would make me feel better if the > semantics of both programs were the same. Why are you getting more > iterations with pure MPI? > > That said, looking at the flop rates, the output does not seem to match your > plots: the output files have almost the same flop rates for MatMult (but > there is a load balance of 1.4 for pure MPI, so I would look at the > partitioning of the problem and try to get perfect load balance, which should > not be hard, I would think ...). KSPSolve is degraded by the vector > operations (AXPPY, norms, VecPointwiseMult, etc.) in pure OpenMP. This is > counterintuitive, there is no communication in these routines and they are > dead simple. I would verify this carefully with the exact same test and if > it holds up dig in with perf tools. > > Mark > > On Mar 17, 2012, at 4:33 AM, Gerard Gorman wrote: > >> Hi >> >> We have profiled on Cray compute nodes with two 16-core AMD Opteron >> 2.3GHz Interlagos processors, using the same matrix but this time with >> -ksp_type cg and -pc_type jacobi. Attached are the logs with the 32 MPI >> processes and the 32 OpenMP threads tests. >> >> Most of the time is in stage 2. As seen previously, MatMult is >> performing well, but the overall performance in KSPSolve drops for >> OpenMP. I have attached a plot of the (hybrid mpi+openmp time)/(pure >> openmp) where all 32 cores are always used. What the graph shows is that >> we are always getting better performance in MatMult for pure OpenMP but >> there is something additional in KSPSolve that degrades the OpenMP >> performance. >> >> So far we have profiled with oprofile measuring the event >> CPU_CLK_UNHALTED, but this has not shown up the bottleneck. So more >> digging is required. >> >> Any suggestions/comments gratefully received. >> >> Cheers >> Gerard >> >> Gerard Gorman emailed the following on 14/03/12 16:59: >>> Hi >>> >>> Since Vec and most of Mat is now threaded we have started to do more >>> detailed profiling. I'm posting these initial tasters from a two socket >>> Intel Core Bloomfield processor system (i.e. 8 cores) to stimulate >>> discussion. >>> >>> The matrix comes from a 3D lock exchange problem discretised using a >>> continuous Galerkin finite element formulation and has about 450k >>> degrees of freedom. >>> >>> I have configured the simulator (Fluidity - >>> http://amcg.ese.ic.ac.uk/Fluidity) to dump out PETSc matrices at each >>> solve. These individual matrices are then solved using >>> petsc-dev/src/ksp/ksp/examples/tests/ex6 compiled with GCC 4.6.3 >>> --with-debugging=0. >>> >>> The PETSc options are: >>> -get_total_flops -pc_type gamg -ksp_type cg -ksp_rtol 1.0e-6 -log_summary >>> >>> The 3 log files attached are for OMP_NUM_THREADS=1, OMP_NUM_THREADS=8 >>> and non-threaded MPI run with 8 processes for comparison. >>> >>> So the reason this benchmark is interesting is because it is pressure >>> which is really stiff , and it uses GAMG as a blackbox. >>> >>> Using xxdiff to compare the logs I think the interesting points are: >>> - Overall OpenMP compares favourably with MPI. >>> - OpenMP converged in 2 less iterations than with MPI. Earlier I was >>> expecting fewer iterations simply because of the absence of partitions >>> to diminish the effectiveness of coarsening. I have not been following >>> Mark's GAMG development but it looks repartitioning is being used to get >>> around that issue (?). However, the biggest plus is because Chebychev is >>> used as a smoother (rather than something difficult to parallelise like >>> SSOR), GAMG appears to scale pretty well when threaded with OpenMP. >>> - Important operations like MatMult etc perform well. >>> - From the summary, "mystage 1" is the main section where OMP appears to >>> need more work. We suffer from operations such as MatPtAP and >>> MatTrnMatMult for example which we have not got around to looking at yet. >>> >>> As this is a relatively small and boring UMA machine I have not bothered >>> with scaling curves. We are setting the same benchmark up on 32-core >>> Interlagos compute nodes at the moment - hopefully these will be ready >>> by tomorrow. >>> >>> Comments welcome. >>> >>> Cheers >>> Gerard >>> >> <pressure-matrix-cg-32mpi.dat><pressure-matrix-cg-jacobi-1mpi-32omp.dat><pressure-matrix-cg-hybrid_speedup.pdf> -------------- next part -------------- A non-text attachment was scrubbed... Name: compare.tar.gz Type: application/x-gzip Size: 3455 bytes Desc: not available URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120317/980418f4/attachment.bin>
