Re: [OMPI devel] poor btl sm latency

Matthias Jurenz Fri, 9 Mar 2012 09:50:36 -0500

I just made an interesting observation:

When binding the processes to two neighboring cores (L2 sharing) NetPIPE shows 
*sometimes* pretty good results: ~0.5us


$ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n 
100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n 100000 -p 0
using object #0 depth 6 below cpuset 0xffffffff,0xffffffff
using object #1 depth 6 below cpuset 0xffffffff,0xffffffff
adding 0x00000001 to 0x0
adding 0x00000001 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x00000001
adding 0x00000002 to 0x0
adding 0x00000002 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x00000002
Using no perturbations

0: n035
Using no perturbations

1: n035
Now starting the main loop
  0:       1 bytes 100000 times -->      6.01 Mbps in       1.27 usec
  1:       2 bytes 100000 times -->     12.04 Mbps in       1.27 usec
  2:       3 bytes 100000 times -->     18.07 Mbps in       1.27 usec
  3:       4 bytes 100000 times -->     24.13 Mbps in       1.26 usec

$ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n 
100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n 100000 -p 0
using object #0 depth 6 below cpuset 0xffffffff,0xffffffff
adding 0x00000001 to 0x0
adding 0x00000001 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x00000001
using object #1 depth 6 below cpuset 0xffffffff,0xffffffff
adding 0x00000002 to 0x0
adding 0x00000002 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x00000002
Using no perturbations

0: n035
Using no perturbations

1: n035
Now starting the main loop
  0:       1 bytes 100000 times -->     12.96 Mbps in       0.59 usec
  1:       2 bytes 100000 times -->     25.78 Mbps in       0.59 usec
  2:       3 bytes 100000 times -->     38.62 Mbps in       0.59 usec
  3:       4 bytes 100000 times -->     52.88 Mbps in       0.58 usec

I can reproduce that approximately every tenth run.

When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I get 
constant latencies ~1.1us

Matthias

On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote:
> Here the SM BTL parameters:
> 
> $ ompi_info --param btl sm
> MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
> default value) Verbosity level of the BTL framework
> MCA btl: parameter "btl" (current value: <self,sm,openib>, data source:
> file
> [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.conf])
> Default selection set of components for the btl framework (<none> means
> use all components that can be found)
> MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source:
> default value) Whether this component supports the knem Linux kernel module
> or not
> MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source:
> default value) Whether knem support is desired or not (negative = try to
> enable knem support, but continue even if it is not available, 0 = do not
> enable knem support, positive = try to enable knem support and fail if it
> is not available)
> MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data source:
> default value) Minimum message size (in bytes) to use the knem DMA mode;
> ignored if knem does not support DMA mode (0 = do not use the knem DMA
> mode) MCA btl: parameter "btl_sm_knem_max_simultaneous" (current value:
> <0>, data source: default value) Max number of simultaneous ongoing knem
> operations to support (0 = do everything synchronously, which probably
> gives the best large message latency; >0 means to do all operations
> asynchronously, which supports better overlap for simultaneous large
> message sends)
> MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data source:
> default value)
> MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data
> source: default value)
> MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data
> source: default value)
> MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data source:
> default value)
> MCA btl: parameter "btl_sm_mpool" (current value: <sm>, data source:
> default value)
> MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data source:
> default value)
> MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source:
> default value)
> MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data
> source: default value)
> MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data
> source: default value)
> MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data
> source: default value) BTL exclusivity (must be >= 0)
> MCA btl: parameter "btl_sm_flags" (current value: <5>, data source: default
> value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4, SEND_INPLACE=8,
> RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only used by the "dr" PML
> (ignored by others): ACK=16, CHECKSUM=32, RDMA_COMPLETION=128; flags only
> used by the "bfo" PML (ignored by others): FAILOVER_SUPPORT=512)
> MCA btl: parameter "btl_sm_rndv_eager_limit" (current value: <4096>, data
> source: default value) Size (in bytes) of "phase 1" fragment sent for all
> large messages (must be >= 0 and <= eager_limit)
> MCA btl: parameter "btl_sm_eager_limit" (current value: <4096>, data
> source: default value) Maximum size (in bytes) of "short" messages (must
> be >= 1). MCA btl: parameter "btl_sm_max_send_size" (current value:
> <32768>, data source: default value) Maximum size (in bytes) of a single
> "phase 2" fragment of a long message when using the pipeline protocol
> (must be >= 1)
> MCA btl: parameter "btl_sm_bandwidth" (current value: <9000>, data source:
> default value) Approximate maximum bandwidth of interconnect(0 =
> auto-detect value at run-time [not supported in all BTL modules], >= 1 =
> bandwidth in Mbps)
> MCA btl: parameter "btl_sm_latency" (current value: <1>, data source:
> default value) Approximate latency of interconnect (must be >= 0)
> MCA btl: parameter "btl_sm_priority" (current value: <0>, data source:
> default value)
> MCA btl: parameter "btl_base_warn_component_unused" (current value: <1>,
> data source: default value) This parameter is used to turn on warning
> messages when certain NICs are not used
> 
> Matthias
> 
> On Friday 02 March 2012 16:23:32 George Bosilca wrote:
> > Please do a "ompi_info --param btl sm" on your environment. The lazy_free
> > direct the internals of the SM BTL not to release the memory fragments
> > used to communicate until the lazy limit is reached. The default value
> > was deemed as reasonable a while back when the number of default
> > fragments was large. Lately there were some patches to reduce the memory
> > footprint of the SM BTL and these might have lowered the available
> > fragments to a limit where the default value for the lazy_free is now
> > too large.
> > 
> >   george.
> > 
> > On Mar 2, 2012, at 10:08 , Matthias Jurenz wrote:
> > > In thanks to the OTPO tool, I figured out that setting the MCA
> > > parameter btl_sm_fifo_lazy_free to 1 (default is 120) improves the
> > > latency significantly: 0,88µs
> > > 
> > > But somehow I get the feeling that this doesn't eliminate the actual
> > > problem...
> > > 
> > > Matthias
> > > 
> > > On Friday 02 March 2012 15:37:03 Matthias Jurenz wrote:
> > >> On Friday 02 March 2012 14:58:45 Jeffrey Squyres wrote:
> > >>> Ok.  Good that there's no oversubscription bug, at least.  :-)
> > >>> 
> > >>> Did you see my off-list mail to you yesterday about building with an
> > >>> external copy of hwloc 1.4 to see if that helps?
> > >> 
> > >> Yes, I did - I answered as well. Our mail server seems to be something
> > >> busy today...
> > >> 
> > >> Just for the record: Using hwloc-1.4 makes no difference.
> > >> 
> > >> Matthias
> > >> 
> > >>> On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:
> > >>>> To exclude a possible bug within the LSF component, I rebuilt Open
> > >>>> MPI without support for LSF (--without-lsf).
> > >>>> 
> > >>>> -> It makes no difference - the latency is still bad: ~1.1us.
> > >>>> 
> > >>>> Matthias
> > >>>> 
> > >>>> On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
> > >>>>> SORRY, it was obviously a big mistake by me. :-(
> > >>>>> 
> > >>>>> Open MPI 1.5.5 was built with LSF support, so when starting an LSF
> > >>>>> job it's necessary to request at least the number of tasks/cores as
> > >>>>> used for the subsequent mpirun command. That was not the case - I
> > >>>>> forgot the bsub's '-n' option to specify the number of task, so
> > >>>>> only *one* task/core was requested.
> > >>>>> 
> > >>>>> Open MPI 1.4.5 was built *without* LSF support, so the supposed
> > >>>>> misbehavior could not happen with it.
> > >>>>> 
> > >>>>> In short, there is no bug in Open MPI 1.5.x regarding to the
> > >>>>> detection of oversubscription. Sorry for any confusion!
> > >>>>> 
> > >>>>> Matthias
> > >>>>> 
> > >>>>> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
> > >>>>>> When using Open MPI v1.4.5 I get ~1.1us. That's the same result as
> > >>>>>> I get with Open MPI v1.5.x using mpi_yield_when_idle=0.
> > >>>>>> So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2)
> > >>>>>> regarding to the automatic performance mode selection.
> > >>>>>> 
> > >>>>>> When enabling the degraded performance mode for Open MPI 1.4.5
> > >>>>>> (mpi_yield_when_idle=1) I get ~1.8us latencies.
> > >>>>>> 
> > >>>>>> Matthias
> > >>>>>> 
> > >>>>>> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
> > >>>>>>> On 13/02/12 22:11, Matthias Jurenz wrote:
> > >>>>>>>> Do you have any idea? Please help!
> > >>>>>>> 
> > >>>>>>> Do you see the same bad latency in the old branch (1.4.5) ?
> > >>>>>>> 
> > >>>>>>> cheers,
> > >>>>>>> Chris
> > >>>>>> 
> > >>>>>> _______________________________________________
> > >>>>>> devel mailing list
> > >>>>>> de...@open-mpi.org
> > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>>> 
> > >>>>> _______________________________________________
> > >>>>> devel mailing list
> > >>>>> de...@open-mpi.org
> > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>> 
> > >>>> _______________________________________________
> > >>>> devel mailing list
> > >>>> de...@open-mpi.org
> > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >> 
> > >> _______________________________________________
> > >> devel mailing list
> > >> de...@open-mpi.org
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > 
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] poor btl sm latency

Reply via email to