Re: [OMPI devel] poor btl sm latency

Matthias Jurenz Thu, 16 Feb 2012 07:06:41 -0500

Jeff,

sorry for the confusion - the all2all is a classic pingpong which uses 
MPI_Send/Recv with 0 byte messages.


One thing I just noticed when using NetPIPE/MPI. Platform MPI results in 
almost constant latencies for small messages (~0.89us), where I don't know 
about process-binding in Platform MPI - I just used the defaults.
When using Open MPI (regardless of core/socket-binding) the results differ from 
run to run:

=== FIRST RUN ===
$ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 100000
Using synchronous sends
1: n029
Using synchronous sends
0: n029
Now starting the main loop
  0:       1 bytes 100000 times -->      4.66 Mbps in       1.64 usec
  1:       2 bytes 100000 times -->      8.94 Mbps in       1.71 usec
  2:       3 bytes 100000 times -->     13.65 Mbps in       1.68 usec
  3:       4 bytes 100000 times -->     17.91 Mbps in       1.70 usec
  4:       6 bytes 100000 times -->     29.04 Mbps in       1.58 usec
  5:       8 bytes 100000 times -->     39.06 Mbps in       1.56 usec
  6:      12 bytes 100000 times -->     57.58 Mbps in       1.59 usec

=== SECOND RUN (~3s after the previous run) ===
$ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 100000
Using synchronous sends
1: n029
Using synchronous sends
0: n029
Now starting the main loop
  0:       1 bytes 100000 times -->      5.73 Mbps in       1.33 usec
  1:       2 bytes 100000 times -->     11.45 Mbps in       1.33 usec
  2:       3 bytes 100000 times -->     17.13 Mbps in       1.34 usec
  3:       4 bytes 100000 times -->     22.94 Mbps in       1.33 usec
  4:       6 bytes 100000 times -->     34.39 Mbps in       1.33 usec
  5:       8 bytes 100000 times -->     46.40 Mbps in       1.32 usec
  6:      12 bytes 100000 times -->     68.92 Mbps in       1.33 usec

=== THIRD RUN ===
$ mpirun -np 2 --bind-to-socket ./NPmpi_ompi1.5.5  -S -u 12 -n 100000
Using synchronous sends
0: n029
Using synchronous sends
1: n029
Now starting the main loop
  0:       1 bytes 100000 times -->      3.50 Mbps in       2.18 usec
  1:       2 bytes 100000 times -->      6.99 Mbps in       2.18 usec
  2:       3 bytes 100000 times -->     10.48 Mbps in       2.18 usec
  3:       4 bytes 100000 times -->     14.00 Mbps in       2.18 usec
  4:       6 bytes 100000 times -->     20.98 Mbps in       2.18 usec
  5:       8 bytes 100000 times -->     27.84 Mbps in       2.19 usec
  6:      12 bytes 100000 times -->     41.99 Mbps in       2.18 usec

At first appearance, I assumed that some CPU power saving feature is enabled. 
But the CPU frequency scaling is set to "performance" and there is only one 
available frequency (2.2GHz).

Any idea how this can happen?


Matthias

On Wednesday 15 February 2012 19:29:38 Jeff Squyres wrote:
> Something is definitely wrong -- 1.4us is way too high for a 0 or 1 byte
> HRT ping pong.  What is this all2all benchmark, btw?  Is it measuring an
> MPI_ALLTOALL, or a pingpong?
> 
> FWIW, on an older Nehalem machine running NetPIPE/MPI, I'm getting about
> .27us latencies for short messages over sm and binding to socket.
> 
> On Feb 14, 2012, at 7:20 AM, Matthias Jurenz wrote:
> > I've built Open MPI 1.5.5rc1 (tarball from Web) with CFLAGS=-O3.
> > Unfortunately, also without any effect.
> > 
> > Here some results with enabled binding reports:
> > 
> > $ mpirun *--bind-to-core* --report-bindings -np 2 ./all2all_ompi1.5.5
> > [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],1]
> > to cpus 0002
> > [n043:61313] [[56788,0],0] odls:default:fork binding child [[56788,1],0]
> > to cpus 0001
> > latency: 1.415us
> > 
> > $ mpirun *-mca maffinity hwloc --bind-to-core* --report-bindings -np 2
> > ./all2all_ompi1.5.5
> > [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],1]
> > to cpus 0002
> > [n043:61469] [[49736,0],0] odls:default:fork binding child [[49736,1],0]
> > to cpus 0001
> > latency: 1.4us
> > 
> > $ mpirun *-mca maffinity first_use --bind-to-core* --report-bindings -np
> > 2 ./all2all_ompi1.5.5
> > [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],1]
> > to cpus 0002
> > [n043:61508] [[49681,0],0] odls:default:fork binding child [[49681,1],0]
> > to cpus 0001
> > latency: 1.4us
> > 
> > 
> > $ mpirun *--bind-to-socket* --report-bindings -np 2 ./all2all_ompi1.5.5
> > [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],1]
> > to socket 0 cpus 0001
> > [n043:61337] [[56780,0],0] odls:default:fork binding child [[56780,1],0]
> > to socket 0 cpus 0001
> > latency: 4.0us
> > 
> > $ mpirun *-mca maffinity hwloc --bind-to-socket* --report-bindings -np 2
> > ./all2all_ompi1.5.5
> > [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],1]
> > to socket 0 cpus 0001
> > [n043:61615] [[49914,0],0] odls:default:fork binding child [[49914,1],0]
> > to socket 0 cpus 0001
> > latency: 4.0us
> > 
> > $ mpirun *-mca maffinity first_use --bind-to-socket* --report-bindings
> > -np 2 ./all2all_ompi1.5.5
> > [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],1]
> > to socket 0 cpus 0001
> > [n043:61639] [[49810,0],0] odls:default:fork binding child [[49810,1],0]
> > to socket 0 cpus 0001
> > latency: 4.0us
> > 
> > 
> > If socket-binding is enabled it seems that all ranks are bind to the very
> > first core of one and the same socket. Is it intended? I expected that
> > each rank gets its own socket (i.e. 2 ranks -> 2 sockets)...
> > 
> > Matthias
> > 
> > On Monday 13 February 2012 22:36:50 Jeff Squyres wrote:
> >> Also, double check that you have an optimized build, not a debugging
> >> build.
> >> 
> >> SVN and HG checkouts default to debugging builds, which add in lots of
> >> latency.
> >> 
> >> On Feb 13, 2012, at 10:22 AM, Ralph Castain wrote:
> >>> Few thoughts
> >>> 
> >>> 1. Bind to socket is broken in 1.5.4 - fixed in next release
> >>> 
> >>> 2. Add --report-bindings to cmd line and see where it thinks the procs
> >>> are bound
> >>> 
> >>> 3. Sounds lime memory may not be local - might be worth checking mem
> >>> binding.
> >>> 
> >>> Sent from my iPad
> >>> 
> >>> On Feb 13, 2012, at 7:07 AM, Matthias Jurenz <matthias.jurenz@tu-
> > 
> > dresden.de> wrote:
> >>>> Hi Sylvain,
> >>>> 
> >>>> thanks for the quick response!
> >>>> 
> >>>> Here some results with enabled process binding. I hope I used the
> >>>> parameters correctly...
> >>>> 
> >>>> bind two ranks to one socket:
> >>>> $ mpirun -np 2 --bind-to-core ./all2all
> >>>> $ mpirun -np 2 -mca mpi_paffinity_alone 1 ./all2all
> >>>> 
> >>>> bind two ranks to two different sockets:
> >>>> $ mpirun -np 2 --bind-to-socket ./all2all
> >>>> 
> >>>> All three runs resulted in similar bad latencies (~1.4us).
> >>>> 
> >>>> :-(
> >>>> 
> >>>> Matthias
> >>>> 
> >>>> On Monday 13 February 2012 12:43:22 sylvain.jeau...@bull.net wrote:
> >>>>> Hi Matthias,
> >>>>> 
> >>>>> You might want to play with process binding to see if your problem is
> >>>>> related to bad memory affinity.
> >>>>> 
> >>>>> Try to launch pingpong on two CPUs of the same socket, then on
> >>>>> different sockets (i.e. bind each process to a core, and try
> >>>>> different configurations).
> >>>>> 
> >>>>> Sylvain
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> De :    Matthias Jurenz <matthias.jur...@tu-dresden.de>
> >>>>> A :     Open MPI Developers <de...@open-mpi.org>
> >>>>> Date :  13/02/2012 12:12
> >>>>> Objet : [OMPI devel] poor btl sm latency
> >>>>> Envoyé par :    devel-boun...@open-mpi.org
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> Hello all,
> >>>>> 
> >>>>> on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad
> >>>>> latencies
> >>>>> (~1.5us) when performing 0-byte p2p communication on one single node
> >>>>> using the
> >>>>> Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies
> >>>>> which is pretty good. The bandwidth results are similar for both MPI
> >>>>> implementations
> >>>>> (~3,3GB/s) - this is okay.
> >>>>> 
> >>>>> One node has 64 cores and 64Gb RAM where it doesn't matter how many
> >>>>> ranks allocated by the application. We get similar results with
> >>>>> different number of
> >>>>> ranks.
> >>>>> 
> >>>>> We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any
> >>>>> special
> >>>>> configure options except the installation prefix and the location of
> >>>>> the LSF
> >>>>> stuff.
> >>>>> 
> >>>>> As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to
> >>>>> use /dev/shm instead of /tmp for the session directory, but it had no
> >>>>> effect. Furthermore, we tried the current release candidate 1.5.5rc1
> >>>>> of Open MPI which
> >>>>> provides an option to use the SysV shared memory (-mca shmem sysv) -
> >>>>> also this
> >>>>> results in similar poor latencies.
> >>>>> 
> >>>>> Do you have any idea? Please help!
> >>>>> 
> >>>>> Thanks,
> >>>>> Matthias
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> de...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> 
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> 
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] poor btl sm latency

Reply via email to