it seem that all confusions have already been shot, thanks Jeff!

Thanks!
Yanfei

Sent from my iPad

> On 2014年3月27日, at 下午8:38, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> 
> wrote:
> 
> Here's a few key facts that might help:
> 
> 1. The hostfile has nothing to do with what network interfaces are used for 
> MPI traffic.  It is only used to specify what servers you launch on, 
> regardless of what IP interface on that server you specify.
> 2. What network interfaces are used are a combination of the BTL selected and 
> then any optional additional parameters given that that BTL.
> 3. If you do not specify any BTL, then Open MPI will choose the "best" ones, 
> and use that.
> 4. As of somewhere in the v1.7.x series, the ompi_info command only shows a 
> few MCA parameters by default.  To see all MCA parameters, add "--level 9" to 
> the command line.
> 
> In your case, if you didn't specify a BTL, Open MPI would see your RoCE 
> interfaces and therefore choose the openib BTL for off-node communication 
> (and exclude the TCP BTL, because it is "worse" than the openib BTL), sm for 
> on-node communication, and self for loopback communication.
> 
> If you specify --mca btl tcp,sm,self, then you are restricting OMPI's pool of 
> BTLs that it can choose from -- meaning that the openib BTL won't even be 
> considered.  So OMPI will therefore use the TCP BTL for off-node 
> communication.
> 
> Also, remember that you can "mpirun ... hostname" (i.e., the Linux "hostname" 
> command) to verify what servers you are actually running on.
> 
> I see that the ompi_info(1) man page is not super-detailed about the --level 
> option; I'll go fix that right now (and ensure it's in the v1.8 release).
> 
> 
> 
>> On Mar 27, 2014, at 6:44 AM, "Wang,Yanfei(SYS)" <wangyanfe...@baidu.com> 
>> wrote:
>> 
>> Hi, 
>> 
>> Update: 
>> If explicitly assign --mca btl tcp,sm,self and the traffic will go 10G 
>> TCP/IP link instead of 40G RDMA link, and the tcp/ip latency is 22us at 
>> average, which is reasonable.
>> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node --mca 
>> btl tcp,sm,self osu_latency
>> # OSU MPI Latency Test v4.3
>> # Size          Latency (us)
>> 0                      22.07
>> 1                      22.48
>> 2                      22.38
>> 4                      22.39
>> 8                      22.52
>> 16                     22.52
>> 32                     22.59
>> 64                     22.73
>> 128                    23.01
>> 256                    24.32
>> 512                    28.50
>> 1024                   31.06
>> 2048                   56.06
>> 4096                   68.53
>> 8192                   77.09
>> 16384                 105.23
>> 32768                 143.51
>> 65536                 229.79
>> 131072                285.28
>> 262144                423.26
>> 524288                693.82
>> 1048576              1634.03
>> 2097152              3311.69
>> 4194304              7055.16
>> 
>> The conclusion is that the “ �Chostfile with 10G IP address” does enable 
>> traffic select 10G TCP/IP link, and mpirun select RDMA link by default even 
>> if you did not enable  “--mca btl openib,sm,self”!
>> So, how to understand that “�Chostfile” does not work fine and how to 
>> control the multi-HCA(NIC) traffic for MPI library?   
>> 
>> Besides, the following command does not reflect any information about rdma 
>> transport parameter control except tcp parameter.
>> 
>> [root@bb-nsi-ib04 pt2pt]# ompi_info --param btl all
>>                 MCA btl: parameter "btl_tcp_if_include" (current value: "",
>>                          data source: default, level: 1 user/basic, type:
>>                          string)
>>                          Comma-delimited list of devices and/or CIDR
>>                          notation of networks to use for MPI communication
>>                          (e.g., "eth0,192.168.0.0/16").  Mutually exclusive
>>                          with btl_tcp_if_exclude.
>>                 MCA btl: parameter "btl_tcp_if_exclude" (current value:
>>                          "127.0.0.1/8,sppp", data source: default, level: 1
>>                          user/basic, type: string)
>>                          Comma-delimited list of devices and/or CIDR
>>                          notation of networks to NOT use for MPI
>>                          communication -- all devices not matching these
>>                          specifications will be used (e.g.,
>>                          "eth0,192.168.0.0/16").  If set to a non-default
>>                          value, it is mutually exclusive with
>>                          btl_tcp_if_include.
>> [root@bb-nsi-ib04 pt2pt]#
>> 
>> Hope to have a deep understand on it~
>> 
>> Thanks
>> --Yanfei
>> 
>> 发件人: devel [mailto:devel-boun...@open-mpi.org] 代表 Wang,Yanfei(SYS)
>> 发送时间: 2014年3月27日 18:17
>> 收件人: Open MPI Developers
>> 主题: [OMPI devel] 答复: doubt on latency result with OpenMPI library
>> 
>> HI, 
>> 
>> “--map-by node” does remove this trouble.  
>> ---
>> Configuration:
>> Even if using mpi --hostfile to control traffic to go 10G TCP/IP network, 
>> and the latency still is 5us in both situation!  
>> [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
>> 192.168.71.3 ib03
>> 192.168.71.4 ib04
>> [root@bb-nsi-ib04 pt2pt]# ifconfig
>> eth0      Link encap:Ethernet  HWaddr 20:0B:C7:26:3F:C3 
>>          inet addr:192.168.71.4  Bcast:192.168.71.255  Mask:255.255.255.0
>>          inet6 addr: fe80::220b:c7ff:fe26:3fc3/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:834635 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:339853 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:681908607 (650.3 MiB)  TX bytes:103031295 (98.2 MiB)  
>> 10G eth0 is not rdma-enabled nic~
>> 
>> a.       using openib
>> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node --mca 
>> btl openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency
>> # OSU MPI Latency Test v4.3
>> # Size          Latency (us)
>> 0                       5.20
>> 1                       5.36
>> 2                       5.31
>> 4                       5.34
>> 8                       5.46
>> 16                      5.35
>> 32                      5.44
>> 64                      5.48
>> 128                     6.74
>> 256                     6.87
>> 512                     7.05
>> 1024                    7.52
>> 2048                    8.38
>> 4096                   10.36
>> 8192                   14.18
>> 16384                  23.69
>> 32768                  31.91
>> 65536                  38.89
>> 131072                 47.76
>> 262144                 80.42
>> 524288                137.52
>> 1048576               251.81
>> 2097152               485.23
>> 4194304               948.08
>> b.       have no explicit rdma setting. 
>> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node 
>> osu_latency
>> # OSU MPI Latency Test v4.3
>> # Size          Latency (us)
>> 0                       5.23
>> 1                       5.28
>> 2                       5.21
>> 4                       5.33
>> 8                       5.33
>> 16                      5.36
>> 32                      5.33
>> 64                      5.41
>> 128                     6.74
>> 256                     6.98
>> 512                     7.11
>> 1024                    7.47
>> 2048                    8.46
>> 4096                   10.38
>> 8192                   14.30
>> 16384                  21.20
>> 32768                  31.21
>> 65536                  39.85
>> 131072                 47.70
>> 262144                 80.24
>> 524288                137.59
>> 1048576               251.62
>> 2097152               485.14
>> 4194304               945.80
>> [root@bb-nsi-ib04 pt2pt]#
>> 
>> I found that the bandwidth got from osu_bw benchmark is equal to 40G RDMA 
>> HCA, so I doubt if the traffic always goes between 40G RDMA link, and the 
>> control for TCP/IP link does work.
>> 
>> I will consult the FAQ for details, if further suggestion, welcome..
>> 
>> Thanks
>> --Yanfei
>> 发件人: devel [mailto:devel-boun...@open-mpi.org] 代表 Ralph Castain
>> 发送时间: 2014年3月27日 18:05
>> 收件人: Open MPI Developers
>> 主题: Re: [OMPI devel] doubt on latency result with OpenMPI library
>> 
>> Try adding "--map-by node" to your command line to ensure the procs really 
>> are running on separate nodes.
>> 
>> 
>> 
>> On Thu, Mar 27, 2014 at 1:40 AM, Wang,Yanfei(SYS) <wangyanfe...@baidu.com> 
>> wrote:
>> Hi, 
>> 
>> HW Test Topology:
>> Ip:192.168.72.4/24 �C192.168.72.4/24, enable vlan and RoCE
>> IB03 server 40G port-- - 40G Ethernet switch ----IB04 server 40G port: 
>> configure it as RoCE link
>> IP: 192.168.71.3/24 ---192.168.71.4/24
>> IB03 server 10G port �C 10G Ethernet switch �C IB04 server 10G port: 
>> configure it as normal TCP/IP Ethernet link:(server management interface)
>> 
>> Mpi configuration:
>> MPI Hosts file:
>> [root@bb-nsi-ib04 pt2pt]# cat hosts
>> ib03 slots=1
>> ib04 slots=1
>> DNS hosts
>> [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
>> 192.168.71.3 ib03
>> 192.168.71.4 ib04
>> [root@bb-nsi-ib04 pt2pt]#
>> This configuration will create 2 nodes for MPI latency evaluation
>> 
>> Benchmark:
>> osu-micro-benchmarks-4.3
>> 
>> result:  
>> a.       Enable traffic go between 10G TCP/IP port using following 
>> /etc/hosts file
>> 
>> 
>> root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
>> 192.168.71.3 ib03
>> 192.168.71.4 ib04
>> The average latency is 4.5us of osu_latency, see log following:
>> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 osu_latency
>> # OSU MPI Latency Test v4.3
>> # Size          Latency (us)
>> 0                       4.56
>> 1                       4.90
>> 2                       4.90
>> 4                       4.60
>> 8                       4.71
>> 16                      4.72
>> 32                      5.40
>> 64                      4.77
>> 128                     6.74
>> 256                     7.01
>> 512                     7.14
>> 1024                    7.63
>> 2048                    8.22
>> 4096                   10.39
>> 8192                   14.26
>> 16384                  20.80
>> 32768                  31.97
>> 65536                  37.75
>> 131072                 47.28
>> 262144                 80.40
>> 524288                137.65
>> 1048576               250.17
>> 2097152               484.71
>> 4194304               946.01
>> 
>> b.       Enable traffic go between RoCE link using /etc/hosts as following 
>> and mpirun �Cmca btl openib,self,sm …
>> 
>> [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
>> 192.168.72.3 ib03
>> 192.168.72.4 ib04
>> Result:
>> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --mca btl 
>> openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency
>> # OSU MPI Latency Test v4.3
>> # Size          Latency (us)
>> 0                       4.83
>> 1                       5.17
>> 2                       5.12
>> 4                       5.25
>> 8                       5.38
>> 16                      5.40
>> 32                      5.19
>> 64                      5.04
>> 128                     6.74
>> 256                     7.04
>> 512                     7.34
>> 1024                    7.91
>> 2048                    8.17
>> 4096                   10.39
>> 8192                   14.22
>> 16384                  22.05
>> 32768                  31.68
>> 65536                  37.57
>> 131072                 48.25
>> 262144                 79.98
>> 524288                137.66
>> 1048576               251.38
>> 2097152               485.66
>> 4194304               947.81
>> [root@bb-nsi-ib04 pt2pt]#
>> 
>> Question:  
>> 1.       Why do they have similar latency, 5us, which is too small to 
>> believe it! In our test environment, it will take more than 50 us to deal 
>> with tcp sync and return sync_ack, and also x86 server will take more thans 
>> 20us at average to do ip forwarding(test from professional HW tester), so 
>> does the latency is reasonable?
>> 
>> 2.       Normally, the switch will introduces more than 1.5us switch time! 
>> Using accelio, a mellanox released opensource rdma library, it will take at 
>> least 4 us rtt latency to do simpe ping-pong test. So 5 us MPI latency 
>> (TCP/IP and RoCE) above is rather unbelievable…  
>> 
>> 3.       The fact that the tcp/ip transport and roce RDMA transport acquire 
>> same latency  is so puzzling..  
>> 
>> 
>> 
>> Before deeply understanding what happened inside the MPI benchmark, can show 
>> us some suggestion? Does the mpirun command works correctly here?
>> It must has some mistakes about this test, pls correct me,.
>> 
>> Eg: tcp syn&sync ack latency:
>> <image001.png>
>> 
>> Thanks
>> -Yanfei
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/03/14400.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/03/14403.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14404.php

Reply via email to