Thanks for the helpful response Jed. I was not aware of the possibility to run seperate PETSC_COMM_WORLDS in the same program, at least this is not clear from the documentation (e.g. http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-dev/docs/manualpages/Sys/PetscInitialize.html) I'll probably still try this out just out of curiosity.
About presenting scaling results, the most appealing to me seems to show two graphs, one with intra-node scaling (1-12) and the other going upwards from there (12, 24, 36, ...) Leo 2010/9/21 Jed Brown <jed at 59a2.org> > On Tue, Sep 21, 2010 at 10:41, Leo van Kampenhout > <lvankampenhout at gmail.com> wrote: > > At the cluster I am currently working on, each node is made up by 12 PEs > and > > have shared memory. When I would just reserve 1 PE for my job, the other > 11 > > processors are given to other users, therefore giving dynamic load on the > > memory system resulting into inaccurate timings. The solve-times I get > are > > ranging between 1 and 5 minutes. For me, this is not very scientific > either. > > First, shared memory and especially NUMA architectures are very > difficult to draw meaningful intra-node scalability conclusions on. > If at all possible, try to compare inter-node scalability instead > since it is a far more reliable estimate and less > architecture-dependent (provided the network is decent). That said, > you should be looking for reproducibility much more than "good" > scaling. It's well known that intra-node memory contention is a major > issue, the STREAM benchmarks actually show _lower_ total bandwidth > when running on all 6 cores per socket with Istanbul than when using > only 4 (and 2 cores is within a few percent). > > > The second idea was to reserve all 12 PEs on the node and just let 1 PE > run > > the job. However, in this way the single CPU gets all the memory > bandwidth > > and has no waiting time, therefore giving very fast results. When I would > > calculate speedup from here, the algorithm does not scale very well. > > I say just do this and live with the poor intra-node scaling numbers. > Some architectures actually scale memory within the node (e.g. > BlueGene), but most don't. People expect to see the memory bottleneck > in these results, it's nothing to be ashamed of. > > > Another idea would be to spawn 12 identical jobs on 12 PEs and take the > > average runtime. Unfortunately, there is only one PETSC_COMM_WORLD, so I > > think this is impossible to do from within one program (MPI_COMM_WORLD). > > You could split MPI_COMM_WORLD and run a separate PETSC_COMM_WORLD on > each group, but I think this option will not be reproducible (the > instances will slightly out of sync, so memory and communication > bottlenecks will be loaded in different ways on subsequent runs) and > is a bit disingenuous because this is not a configuration that you > would ever run in practice. > > Jed > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20100921/e67d4f3d/attachment.htm>
