Well some how the inbalance comes up in your application run - but not in the test example. It is possible that the application stresses your machine/memory-subsytem a lot more than the test code.
Your machine has a NUMA [Non-unimform memory access] - so some messages are local [if the memory is local - and others can take atleast 3 hops trhough the AMD memory/hypertransport network. I was assuming the delays due to multiple hops might show up in this test runs I requested. [but it does not]. So perhaps these multiple hops cause delays only when the memort network gets stressed - as with your application? http://www.thg.ru/cpu/20040929/images/opteron_8way.gif I guess we'll just have to use your app to benchmark. Earlier I sugested using latest mpich with '--device=ch3:sshm'. Another option to try is '--with-device=ch3:nemesis' To do these experiments - you can build different versions of PETSc [so that you can switch between them all]. i.e use a different value for PETSC_ARCH for each build: It is possible that some of the load imbalance happens before the communication stages - but its visible only in the scatter state [in log_summary]. So to get a better idea on this - we'll need a Barrier in VecScatterBegin(). Not sure how to do this. Barry: does -log_sync add a barrier in vecscatter? Also - can you confirm that no-one-else/no-other-application is using this machine when you perform these measurement runs? Satish On Sat, 10 Feb 2007, Shi Jin wrote: > Furthermore, I did a multi-process test on the SMP. > petscmpirun -n 3 taskset -c 0,2,4 ./ex2 -ksp_type cg > -log_summary | egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 4.19617e-06 > Average time for zero size MPI_Send(): 3.65575e-06 > > petscmpirun -n 4 taskset -c 0,2,4,6 ./ex2 -ksp_type > cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 1.75953e-05 > Average time for zero size MPI_Send(): 2.44975e-05 > > petscmpirun -n 5 taskset -c 0,2,4,6,8 ./ex2 -ksp_type > cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 4.22001e-05 > Average time for zero size MPI_Send(): 2.54154e-05 > > petscmpirun -n 6 taskset -c 0,2,4,6,8,10 ./ex2 > -ksp_type cg -log_summary | egrep > \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 4.87804e-05 > Average time for zero size MPI_Send(): 1.83185e-05 > > petscmpirun -n 7 taskset -c 0,2,4,6,8,10,12 ./ex2 > -ksp_type cg -log_summary | egrep > \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 2.37942e-05 > Average time for zero size MPI_Send(): 5.00679e-06 > > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 ./ex2 > -ksp_type cg -log_summary | egrep > \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 1.35899e-05 > Average time for zero size MPI_Send(): 6.73532e-06 > > They all seem quite fast. > Shi > > --- Shi Jin <jinzishuai at yahoo.com> wrote: > > > Yes. The results follow. > > --- Satish Balay <balay at mcs.anl.gov> wrote: > > > > > Can you send the optupt from the following runs. > > You > > > can do this with > > > src/ksp/ksp/examples/tutorials/ex2.c - to keep > > > things simple. > > > > > > petscmpirun -n 2 taskset -c 0,2 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.81198e-06 > > Average time for zero size MPI_Send(): 5.00679e-06 > > > petscmpirun -n 2 taskset -c 0,4 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 2.00272e-06 > > Average time for zero size MPI_Send(): 4.05312e-06 > > > petscmpirun -n 2 taskset -c 0,6 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.7643e-06 > > Average time for zero size MPI_Send(): 4.05312e-06 > > > petscmpirun -n 2 taskset -c 0,8 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 2.00272e-06 > > Average time for zero size MPI_Send(): 4.05312e-06 > > > petscmpirun -n 2 taskset -c 0,12 ./ex2 > > -log_summary > > > | egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.57356e-06 > > Average time for zero size MPI_Send(): 5.48363e-06 > > > petscmpirun -n 2 taskset -c 0,14 ./ex2 > > -log_summary > > > | egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 2.00272e-06 > > Average time for zero size MPI_Send(): 4.52995e-06 > > I also did > > petscmpirun -n 2 taskset -c 0,10 ./ex2 -log_summary > > | > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 5.00679e-06 > > Average time for zero size MPI_Send(): 3.93391e-06 > > > > > > The results are not so different from each other. > > Also > > please note, the timing is not exact, some times I > > got > > O(1e-5) timings for all cases. > > I assume these numbers are pretty good, right? Does > > it > > indicate that the MPI communication on a SMP machine > > is very fast? > > I will do a similar test on a cluster and report it > > back to the list. > > > > Shi > > > > > > > > > > > > > ____________________________________________________________________________________ > > Need Mail bonding? > > Go to the Yahoo! Mail Q&A for great tips from Yahoo! > > Answers users. > > > http://answers.yahoo.com/dir/?link=list&sid=396546091 > > > > > > > > > ____________________________________________________________________________________ > Yahoo! Music Unlimited > Access over 1 million songs. > http://music.yahoo.com/unlimited > >