I just made an interesting observation: When binding the processes to two neighboring cores (L2 sharing) NetPIPE shows *sometimes* pretty good results: ~0.5us
$ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n 100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n 100000 -p 0 using object #0 depth 6 below cpuset 0xffffffff,0xffffffff using object #1 depth 6 below cpuset 0xffffffff,0xffffffff adding 0x00000001 to 0x0 adding 0x00000001 to 0x0 assuming the command starts at ./NPmpi_ompi1.5.5 binding on cpu set 0x00000001 adding 0x00000002 to 0x0 adding 0x00000002 to 0x0 assuming the command starts at ./NPmpi_ompi1.5.5 binding on cpu set 0x00000002 Using no perturbations 0: n035 Using no perturbations 1: n035 Now starting the main loop 0: 1 bytes 100000 times --> 6.01 Mbps in 1.27 usec 1: 2 bytes 100000 times --> 12.04 Mbps in 1.27 usec 2: 3 bytes 100000 times --> 18.07 Mbps in 1.27 usec 3: 4 bytes 100000 times --> 24.13 Mbps in 1.26 usec $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n 100000 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n 100000 -p 0 using object #0 depth 6 below cpuset 0xffffffff,0xffffffff adding 0x00000001 to 0x0 adding 0x00000001 to 0x0 assuming the command starts at ./NPmpi_ompi1.5.5 binding on cpu set 0x00000001 using object #1 depth 6 below cpuset 0xffffffff,0xffffffff adding 0x00000002 to 0x0 adding 0x00000002 to 0x0 assuming the command starts at ./NPmpi_ompi1.5.5 binding on cpu set 0x00000002 Using no perturbations 0: n035 Using no perturbations 1: n035 Now starting the main loop 0: 1 bytes 100000 times --> 12.96 Mbps in 0.59 usec 1: 2 bytes 100000 times --> 25.78 Mbps in 0.59 usec 2: 3 bytes 100000 times --> 38.62 Mbps in 0.59 usec 3: 4 bytes 100000 times --> 52.88 Mbps in 0.58 usec I can reproduce that approximately every tenth run. When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I get constant latencies ~1.1us Matthias On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote: > Here the SM BTL parameters: > > $ ompi_info --param btl sm > MCA btl: parameter "btl_base_verbose" (current value: <0>, data source: > default value) Verbosity level of the BTL framework > MCA btl: parameter "btl" (current value: <self,sm,openib>, data source: > file > [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.conf]) > Default selection set of components for the btl framework (<none> means > use all components that can be found) > MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source: > default value) Whether this component supports the knem Linux kernel module > or not > MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source: > default value) Whether knem support is desired or not (negative = try to > enable knem support, but continue even if it is not available, 0 = do not > enable knem support, positive = try to enable knem support and fail if it > is not available) > MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data source: > default value) Minimum message size (in bytes) to use the knem DMA mode; > ignored if knem does not support DMA mode (0 = do not use the knem DMA > mode) MCA btl: parameter "btl_sm_knem_max_simultaneous" (current value: > <0>, data source: default value) Max number of simultaneous ongoing knem > operations to support (0 = do everything synchronously, which probably > gives the best large message latency; >0 means to do all operations > asynchronously, which supports better overlap for simultaneous large > message sends) > MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data source: > default value) > MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data > source: default value) > MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data > source: default value) > MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data source: > default value) > MCA btl: parameter "btl_sm_mpool" (current value: <sm>, data source: > default value) > MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data source: > default value) > MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source: > default value) > MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data > source: default value) > MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data > source: default value) > MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data > source: default value) BTL exclusivity (must be >= 0) > MCA btl: parameter "btl_sm_flags" (current value: <5>, data source: default > value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4, SEND_INPLACE=8, > RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only used by the "dr" PML > (ignored by others): ACK=16, CHECKSUM=32, RDMA_COMPLETION=128; flags only > used by the "bfo" PML (ignored by others): FAILOVER_SUPPORT=512) > MCA btl: parameter "btl_sm_rndv_eager_limit" (current value: <4096>, data > source: default value) Size (in bytes) of "phase 1" fragment sent for all > large messages (must be >= 0 and <= eager_limit) > MCA btl: parameter "btl_sm_eager_limit" (current value: <4096>, data > source: default value) Maximum size (in bytes) of "short" messages (must > be >= 1). MCA btl: parameter "btl_sm_max_send_size" (current value: > <32768>, data source: default value) Maximum size (in bytes) of a single > "phase 2" fragment of a long message when using the pipeline protocol > (must be >= 1) > MCA btl: parameter "btl_sm_bandwidth" (current value: <9000>, data source: > default value) Approximate maximum bandwidth of interconnect(0 = > auto-detect value at run-time [not supported in all BTL modules], >= 1 = > bandwidth in Mbps) > MCA btl: parameter "btl_sm_latency" (current value: <1>, data source: > default value) Approximate latency of interconnect (must be >= 0) > MCA btl: parameter "btl_sm_priority" (current value: <0>, data source: > default value) > MCA btl: parameter "btl_base_warn_component_unused" (current value: <1>, > data source: default value) This parameter is used to turn on warning > messages when certain NICs are not used > > Matthias > > On Friday 02 March 2012 16:23:32 George Bosilca wrote: > > Please do a "ompi_info --param btl sm" on your environment. The lazy_free > > direct the internals of the SM BTL not to release the memory fragments > > used to communicate until the lazy limit is reached. The default value > > was deemed as reasonable a while back when the number of default > > fragments was large. Lately there were some patches to reduce the memory > > footprint of the SM BTL and these might have lowered the available > > fragments to a limit where the default value for the lazy_free is now > > too large. > > > > george. > > > > On Mar 2, 2012, at 10:08 , Matthias Jurenz wrote: > > > In thanks to the OTPO tool, I figured out that setting the MCA > > > parameter btl_sm_fifo_lazy_free to 1 (default is 120) improves the > > > latency significantly: 0,88µs > > > > > > But somehow I get the feeling that this doesn't eliminate the actual > > > problem... > > > > > > Matthias > > > > > > On Friday 02 March 2012 15:37:03 Matthias Jurenz wrote: > > >> On Friday 02 March 2012 14:58:45 Jeffrey Squyres wrote: > > >>> Ok. Good that there's no oversubscription bug, at least. :-) > > >>> > > >>> Did you see my off-list mail to you yesterday about building with an > > >>> external copy of hwloc 1.4 to see if that helps? > > >> > > >> Yes, I did - I answered as well. Our mail server seems to be something > > >> busy today... > > >> > > >> Just for the record: Using hwloc-1.4 makes no difference. > > >> > > >> Matthias > > >> > > >>> On Mar 2, 2012, at 8:26 AM, Matthias Jurenz wrote: > > >>>> To exclude a possible bug within the LSF component, I rebuilt Open > > >>>> MPI without support for LSF (--without-lsf). > > >>>> > > >>>> -> It makes no difference - the latency is still bad: ~1.1us. > > >>>> > > >>>> Matthias > > >>>> > > >>>> On Friday 02 March 2012 13:50:13 Matthias Jurenz wrote: > > >>>>> SORRY, it was obviously a big mistake by me. :-( > > >>>>> > > >>>>> Open MPI 1.5.5 was built with LSF support, so when starting an LSF > > >>>>> job it's necessary to request at least the number of tasks/cores as > > >>>>> used for the subsequent mpirun command. That was not the case - I > > >>>>> forgot the bsub's '-n' option to specify the number of task, so > > >>>>> only *one* task/core was requested. > > >>>>> > > >>>>> Open MPI 1.4.5 was built *without* LSF support, so the supposed > > >>>>> misbehavior could not happen with it. > > >>>>> > > >>>>> In short, there is no bug in Open MPI 1.5.x regarding to the > > >>>>> detection of oversubscription. Sorry for any confusion! > > >>>>> > > >>>>> Matthias > > >>>>> > > >>>>> On Tuesday 28 February 2012 13:36:56 Matthias Jurenz wrote: > > >>>>>> When using Open MPI v1.4.5 I get ~1.1us. That's the same result as > > >>>>>> I get with Open MPI v1.5.x using mpi_yield_when_idle=0. > > >>>>>> So I think there is a bug in Open MPI (v1.5.4 and v1.5.5rc2) > > >>>>>> regarding to the automatic performance mode selection. > > >>>>>> > > >>>>>> When enabling the degraded performance mode for Open MPI 1.4.5 > > >>>>>> (mpi_yield_when_idle=1) I get ~1.8us latencies. > > >>>>>> > > >>>>>> Matthias > > >>>>>> > > >>>>>> On Tuesday 28 February 2012 06:20:28 Christopher Samuel wrote: > > >>>>>>> On 13/02/12 22:11, Matthias Jurenz wrote: > > >>>>>>>> Do you have any idea? Please help! > > >>>>>>> > > >>>>>>> Do you see the same bad latency in the old branch (1.4.5) ? > > >>>>>>> > > >>>>>>> cheers, > > >>>>>>> Chris > > >>>>>> > > >>>>>> _______________________________________________ > > >>>>>> devel mailing list > > >>>>>> de...@open-mpi.org > > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>>>> > > >>>>> _______________________________________________ > > >>>>> devel mailing list > > >>>>> de...@open-mpi.org > > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>>> > > >>>> _______________________________________________ > > >>>> devel mailing list > > >>>> de...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >> > > >> _______________________________________________ > > >> devel mailing list > > >> de...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel