I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3. Unfortunately, also without any effect.
Here some results with enabled binding reports: $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5 [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1] to cpus 0002 [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0] to cpus 0001 latency: 1.415us $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5 [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1] to cpus 0002 [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0] to cpus 0001 latency: 1.4us $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5 [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1] to cpus 0002 [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0] to cpus 0001 latency: 1.4us $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5 [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1] to socket 0 cpus 0001 [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0] to socket 0 cpus 0001 latency: 4.0us $ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5 [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1] to socket 0 cpus 0001 [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0] to socket 0 cpus 0001 latency: 4.0us $ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5 [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],1] to socket 0 cpus 0001 [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],0] to socket 0 cpus 0001 latency: 4.0us If socket-binding is enabled it seems that all ranks are bind to the very first core of one and the same socket. Is it intended? I expected that each rank gets its own socket (i.e. 2 ranks -> 2 sockets)... Matthias On Monday 13 February 2012 22:36:50 Jeff Squyres wrote: > Also, double check that you have an optimized build, not a debugging build. > > SVN and HG checkouts default to debugging builds, which add in lots of > latency. > > On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote: > > Few thoughts > > > > 1. Bind to socket is broken in 1.5.4 - fixed in next release > > > > 2. Add --report-bindings to cmd line and see where it thinks the procs > > are bound > > > > 3. Sounds lime memory may not be local - might be worth checking mem > > binding. > > > > Sent from my iPad > > > > On Feb 13, 2012, at 7:07 AM, Matthias Jurenz <matthias.jurenz@tu- dresden.de> wrote: > >> Hi Sylvain, > >> > >> thanks for the quick response! > >> > >> Here some results with enabled process binding. I hope I used the > >> parameters correctly... > >> > >> bind two ranks to one socket: > >> $ mpirun -np 2 --bind-to-core ./all2all > >> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all > >> > >> bind two ranks to two different sockets: > >> $ mpirun -np 2 --bind-to-socket ./all2all > >> > >> All three runs resulted in similar bad latencies (~1.4us). > >> > >> :-( > >> > >> Matthias > >> > >> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote: > >>> Hi Matthias, > >>> > >>> You might want to play with process binding to see if your problem is > >>> related to bad memory affinity. > >>> > >>> Try to launch pingpong on two CPUs of the same socket, then on > >>> different sockets (i.e. bind each process to a core, and try different > >>> configurations). > >>> > >>> Sylvain > >>> > >>> > >>> > >>> De : Matthias Jurenz <matthias.jur...@tu-dresden.de> > >>> A : Open MPI Developers <de...@open-mpi.org> > >>> Date : 13/02/2012 12:12 > >>> Objet : [OMPI devel] poor btl sm latency > >>> Envoyé par : devel-boun...@open-mpi.org > >>> > >>> > >>> > >>> Hello all, > >>> > >>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad > >>> latencies > >>> (~1.5us) when performing 0-byte p2p communication on one single node > >>> using the > >>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which > >>> is pretty good. The bandwidth results are similar for both MPI > >>> implementations > >>> (~3,3GB/s) - this is okay. > >>> > >>> One node has 64 cores and 64Gb RAM where it doesn't matter how many > >>> ranks allocated by the application. We get similar results with > >>> different number of > >>> ranks. > >>> > >>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any > >>> special > >>> configure options except the installation prefix and the location of > >>> the LSF > >>> stuff. > >>> > >>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to > >>> use /dev/shm instead of /tmp for the session directory, but it had no > >>> effect. Furthermore, we tried the current release candidate 1.5.5rc1 > >>> of Open MPI which > >>> provides an option to use the SysV shared memory (-mca shmem sysv) - > >>> also this > >>> results in similar poor latencies. > >>> > >>> Do you have any idea? Please help! > >>> > >>> Thanks, > >>> Matthias > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel