Re: [OMPI devel] poor btl sm latency

Matthias Jurenz Mon, 12 Mar 2012 06:09:07 -0400

It's a SUSE Linux Enterprise Server 11 Service Pack 1 with kernel version 
2.6.32.49-0.3-default.


Matthias

On Friday 09 March 2012 16:36:41 you wrote:
> What OS are you using ?
> 
> Joshua
> 
> ----- Original Message -----
> From: Matthias Jurenz [mailto:[email protected]]
> Sent: Friday, March 09, 2012 08:50 AM
> To: Open MPI Developers <[email protected]>
> Cc: Mora, Joshua
> Subject: Re: [OMPI devel] poor btl sm latency
> 
> I just made an interesting observation:
> 
> When binding the processes to two neighboring cores (L2 sharing) NetPIPE
> shows *sometimes* pretty good results: ~0.5us
> 
> $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4
> -n 100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> 100000 -p 0 using object #0 depth 6 below cpuset 0xffffffff,0xffffffff
> using object #1 depth 6 below cpuset 0xffffffff,0xffffffff
> adding 0x00000001 to 0x0
> adding 0x00000001 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x00000001
> adding 0x00000002 to 0x0
> adding 0x00000002 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x00000002
> Using no perturbations
> 
> 0: n035
> Using no perturbations
> 
> 1: n035
> Now starting the main loop
>   0:       1 bytes 100000 times -->      6.01 Mbps in       1.27 usec
>   1:       2 bytes 100000 times -->     12.04 Mbps in       1.27 usec
>   2:       3 bytes 100000 times -->     18.07 Mbps in       1.27 usec
>   3:       4 bytes 100000 times -->     24.13 Mbps in       1.26 usec
> 
> $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4
> -n 100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n
> 100000 -p 0 using object #0 depth 6 below cpuset 0xffffffff,0xffffffff
> adding 0x00000001 to 0x0
> adding 0x00000001 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x00000001
> using object #1 depth 6 below cpuset 0xffffffff,0xffffffff
> adding 0x00000002 to 0x0
> adding 0x00000002 to 0x0
> assuming the command starts at ./NPmpi_ompi1.5.5
> binding on cpu set 0x00000002
> Using no perturbations
> 
> 0: n035
> Using no perturbations
> 
> 1: n035
> Now starting the main loop
>   0:       1 bytes 100000 times -->     12.96 Mbps in       0.59 usec
>   1:       2 bytes 100000 times -->     25.78 Mbps in       0.59 usec
>   2:       3 bytes 100000 times -->     38.62 Mbps in       0.59 usec
>   3:       4 bytes 100000 times -->     52.88 Mbps in       0.58 usec
> 
> I can reproduce that approximately every tenth run.
> 
> When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I
> get constant latencies ~1.1us
> 
> Matthias
> 
> On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote:
> > Here the SM BTL parameters:
> > 
> > $ ompi_info --param btl sm
> > MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
> > default value) Verbosity level of the BTL framework
> > MCA btl: parameter "btl" (current value: <self,sm,openib>, data source:
> > file
> > [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.conf]
> > ) Default selection set of components for the btl framework (<none> means
> > use all components that can be found)
> > MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source:
> > default value) Whether this component supports the knem Linux kernel
> > module or not
> > MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source:
> > default value) Whether knem support is desired or not (negative = try to
> > enable knem support, but continue even if it is not available, 0 = do not
> > enable knem support, positive = try to enable knem support and fail if it
> > is not available)
> > MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data
> > source: default value) Minimum message size (in bytes) to use the knem
> > DMA mode; ignored if knem does not support DMA mode (0 = do not use the
> > knem DMA mode) MCA btl: parameter "btl_sm_knem_max_simultaneous"
> > (current value: <0>, data source: default value) Max number of
> > simultaneous ongoing knem operations to support (0 = do everything
> > synchronously, which probably gives the best large message latency; >0
> > means to do all operations asynchronously, which supports better overlap
> > for simultaneous large message sends)
> > MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data source:
> > default value)
> > MCA btl: parameter "btl_sm_mpool" (current value: <sm>, data source:
> > default value)
> > MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source:
> > default value)
> > MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data
> > source: default value)
> > MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data
> > source: default value) BTL exclusivity (must be >= 0)
> > MCA btl: parameter "btl_sm_flags" (current value: <5>, data source:
> > default value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
> > SEND_INPLACE=8, RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only used
> > by the "dr" PML (ignored by others): ACK=16, CHECKSUM=32,
> > RDMA_COMPLETION=128; flags only used by the "bfo" PML (ignored by
> > others): FAILOVER_SUPPORT=512) MCA btl: parameter
> > "btl_sm_rndv_eager_limit" (current value: <4096>, data source: default
> > value) Size (in bytes) of "phase 1" fragment sent for all large messages
> > (must be >= 0 and <= eager_limit)
> > MCA btl: parameter "btl_sm_eager_limit" (current value: <4096>, data
> > source: default value) Maximum size (in bytes) of "short" messages (must
> > be >= 1). MCA btl: parameter "btl_sm_max_send_size" (current value:
> > <32768>, data source: default value) Maximum size (in bytes) of a single
> > "phase 2" fragment of a long message when using the pipeline protocol
> > (must be >= 1)
> > MCA btl: parameter "btl_sm_bandwidth" (current value: <9000>, data
> > source: default value) Approximate maximum bandwidth of interconnect(0 =
> > auto-detect value at run-time [not supported in all BTL modules], >= 1 =
> > bandwidth in Mbps)
> > MCA btl: parameter "btl_sm_latency" (current value: <1>, data source:
> > default value) Approximate latency of interconnect (must be >= 0)
> > MCA btl: parameter "btl_sm_priority" (current value: <0>, data source:
> > default value)
> > MCA btl: parameter "btl_base_warn_component_unused" (current value: <1>,
> > data source: default value) This parameter is used to turn on warning
> > messages when certain NICs are not used
> > 
> > Matthias
> > 
> > On Friday 02 March 2012 16:23:32 George Bosilca wrote:
> > > Please do a "ompi_info --param btl sm" on your environment. The
> > > lazy_free direct the internals of the SM BTL not to release the memory
> > > fragments used to communicate until the lazy limit is reached. The
> > > default value was deemed as reasonable a while back when the number of
> > > default fragments was large. Lately there were some patches to reduce
> > > the memory footprint of the SM BTL and these might have lowered the
> > > available fragments to a limit where the default value for the
> > > lazy_free is now too large.
> > > 
> > >   george.
> > > 
> > > On Mar 2, 2012, at 10:08 , Matthias Jurenz wrote:
> > > > In thanks to the OTPO tool, I figured out that setting the MCA
> > > > parameter btl_sm_fifo_lazy_free to 1 (default is 120) improves the
> > > > latency significantly: 0,88µs
> > > > 
> > > > But somehow I get the feeling that this doesn't eliminate the actual
> > > > problem...
> > > > 
> > > > Matthias
> > > > 
> > > > On Friday 02 March 2012 15:37:03 Matthias Jurenz wrote:
> > > >> On Friday 02 March 2012 14:58:45 Jeffrey Squyres wrote:
> > > >>> Ok.  Good that there's no oversubscription bug, at least.  :-)
> > > >>> 
> > > >>> Did you see my off-list mail to you yesterday about building with
> > > >>> an external copy of hwloc 1.4 to see if that helps?
> > > >> 
> > > >> Yes, I did - I answered as well. Our mail server seems to be
> > > >> something busy today...
> > > >> 
> > > >> Just for the record: Using hwloc-1.4 makes no difference.
> > > >> 
> > > >> Matthias
> > > >> 
> > > >>> On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote:
> > > >>>> To exclude a possible bug within the LSF component, I rebuilt Open
> > > >>>> MPI without support for LSF (--without-lsf).
> > > >>>> 
> > > >>>> -> It makes no difference - the latency is still bad: ~1.1us.
> > > >>>> 
> > > >>>> Matthias
> > > >>>> 
> > > >>>> On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote:
> > > >>>>> SORRY, it was obviously a big mistake by me. :-(
> > > >>>>> 
> > > >>>>> Open MPI 1.5.5 was built with LSF support, so when starting an
> > > >>>>> LSF job it's necessary to request at least the number of
> > > >>>>> tasks/cores as used for the subsequent mpirun command. That was
> > > >>>>> not the case - I forgot the bsub's '-n' option to specify the
> > > >>>>> number of task, so only *one* task/core was requested.
> > > >>>>> 
> > > >>>>> Open MPI 1.4.5 was built *without* LSF support, so the supposed
> > > >>>>> misbehavior could not happen with it.
> > > >>>>> 
> > > >>>>> In short, there is no bug in Open MPI 1.5.x regarding to the
> > > >>>>> detection of oversubscription. Sorry for any confusion!
> > > >>>>> 
> > > >>>>> Matthias
> > > >>>>> 
> > > >>>>> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote:
> > > >>>>>> When using Open MPI v1.4.5 I get ~1.1us. That's the same result
> > > >>>>>> as I get with Open MPI v1.5.x using mpi_yield_when_idle=0. So I
> > > >>>>>> think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2)
> > > >>>>>> regarding to the automatic performance mode selection.
> > > >>>>>> 
> > > >>>>>> When enabling the degraded performance mode for Open MPI 1.4.5
> > > >>>>>> (mpi_yield_when_idle=1) I get ~1.8us latencies.
> > > >>>>>> 
> > > >>>>>> Matthias
> > > >>>>>> 
> > > >>>>>> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote:
> > > >>>>>>> On 13/02/12 22:11, Matthias Jurenz wrote:
> > > >>>>>>>> Do you have any idea? Please help!
> > > >>>>>>> 
> > > >>>>>>> Do you see the same bad latency in the old branch (1.4.5) ?
> > > >>>>>>> 
> > > >>>>>>> cheers,
> > > >>>>>>> Chris
> > > >>>>>> 
> > > >>>>>> _______________________________________________
> > > >>>>>> devel mailing list
> > > >>>>>> [email protected]
> > > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >>>>> 
> > > >>>>> _______________________________________________
> > > >>>>> devel mailing list
> > > >>>>> [email protected]
> > > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >>>> 
> > > >>>> _______________________________________________
> > > >>>> devel mailing list
> > > >>>> [email protected]
> > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >> 
> > > >> _______________________________________________
> > > >> devel mailing list
> > > >> [email protected]
> > > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > 
> > > > _______________________________________________
> > > > devel mailing list
> > > > [email protected]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > 
> > > _______________________________________________
> > > devel mailing list
> > > [email protected]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] poor btl sm latency

Reply via email to