it seem that all confusions have already been shot, thanks Jeff! Thanks! Yanfei
Sent from my iPad > On 2014年3月27日, at 下午8:38, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > wrote: > > Here's a few key facts that might help: > > 1. The hostfile has nothing to do with what network interfaces are used for > MPI traffic. It is only used to specify what servers you launch on, > regardless of what IP interface on that server you specify. > 2. What network interfaces are used are a combination of the BTL selected and > then any optional additional parameters given that that BTL. > 3. If you do not specify any BTL, then Open MPI will choose the "best" ones, > and use that. > 4. As of somewhere in the v1.7.x series, the ompi_info command only shows a > few MCA parameters by default. To see all MCA parameters, add "--level 9" to > the command line. > > In your case, if you didn't specify a BTL, Open MPI would see your RoCE > interfaces and therefore choose the openib BTL for off-node communication > (and exclude the TCP BTL, because it is "worse" than the openib BTL), sm for > on-node communication, and self for loopback communication. > > If you specify --mca btl tcp,sm,self, then you are restricting OMPI's pool of > BTLs that it can choose from -- meaning that the openib BTL won't even be > considered. So OMPI will therefore use the TCP BTL for off-node > communication. > > Also, remember that you can "mpirun ... hostname" (i.e., the Linux "hostname" > command) to verify what servers you are actually running on. > > I see that the ompi_info(1) man page is not super-detailed about the --level > option; I'll go fix that right now (and ensure it's in the v1.8 release). > > > >> On Mar 27, 2014, at 6:44 AM, "Wang,Yanfei(SYS)" <wangyanfe...@baidu.com> >> wrote: >> >> Hi, >> >> Update: >> If explicitly assign --mca btl tcp,sm,self and the traffic will go 10G >> TCP/IP link instead of 40G RDMA link, and the tcp/ip latency is 22us at >> average, which is reasonable. >> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node --mca >> btl tcp,sm,self osu_latency >> # OSU MPI Latency Test v4.3 >> # Size Latency (us) >> 0 22.07 >> 1 22.48 >> 2 22.38 >> 4 22.39 >> 8 22.52 >> 16 22.52 >> 32 22.59 >> 64 22.73 >> 128 23.01 >> 256 24.32 >> 512 28.50 >> 1024 31.06 >> 2048 56.06 >> 4096 68.53 >> 8192 77.09 >> 16384 105.23 >> 32768 143.51 >> 65536 229.79 >> 131072 285.28 >> 262144 423.26 >> 524288 693.82 >> 1048576 1634.03 >> 2097152 3311.69 >> 4194304 7055.16 >> >> The conclusion is that the “ �Chostfile with 10G IP address” does enable >> traffic select 10G TCP/IP link, and mpirun select RDMA link by default even >> if you did not enable “--mca btl openib,sm,self”! >> So, how to understand that “�Chostfile” does not work fine and how to >> control the multi-HCA(NIC) traffic for MPI library? >> >> Besides, the following command does not reflect any information about rdma >> transport parameter control except tcp parameter. >> >> [root@bb-nsi-ib04 pt2pt]# ompi_info --param btl all >> MCA btl: parameter "btl_tcp_if_include" (current value: "", >> data source: default, level: 1 user/basic, type: >> string) >> Comma-delimited list of devices and/or CIDR >> notation of networks to use for MPI communication >> (e.g., "eth0,192.168.0.0/16"). Mutually exclusive >> with btl_tcp_if_exclude. >> MCA btl: parameter "btl_tcp_if_exclude" (current value: >> "127.0.0.1/8,sppp", data source: default, level: 1 >> user/basic, type: string) >> Comma-delimited list of devices and/or CIDR >> notation of networks to NOT use for MPI >> communication -- all devices not matching these >> specifications will be used (e.g., >> "eth0,192.168.0.0/16"). If set to a non-default >> value, it is mutually exclusive with >> btl_tcp_if_include. >> [root@bb-nsi-ib04 pt2pt]# >> >> Hope to have a deep understand on it~ >> >> Thanks >> --Yanfei >> >> 发件人: devel [mailto:devel-boun...@open-mpi.org] 代表 Wang,Yanfei(SYS) >> 发送时间: 2014年3月27日 18:17 >> 收件人: Open MPI Developers >> 主题: [OMPI devel] 答复: doubt on latency result with OpenMPI library >> >> HI, >> >> “--map-by node” does remove this trouble. >> --- >> Configuration: >> Even if using mpi --hostfile to control traffic to go 10G TCP/IP network, >> and the latency still is 5us in both situation! >> [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts >> 192.168.71.3 ib03 >> 192.168.71.4 ib04 >> [root@bb-nsi-ib04 pt2pt]# ifconfig >> eth0 Link encap:Ethernet HWaddr 20:0B:C7:26:3F:C3 >> inet addr:192.168.71.4 Bcast:192.168.71.255 Mask:255.255.255.0 >> inet6 addr: fe80::220b:c7ff:fe26:3fc3/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:834635 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:339853 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:681908607 (650.3 MiB) TX bytes:103031295 (98.2 MiB) >> 10G eth0 is not rdma-enabled nic~ >> >> a. using openib >> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node --mca >> btl openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency >> # OSU MPI Latency Test v4.3 >> # Size Latency (us) >> 0 5.20 >> 1 5.36 >> 2 5.31 >> 4 5.34 >> 8 5.46 >> 16 5.35 >> 32 5.44 >> 64 5.48 >> 128 6.74 >> 256 6.87 >> 512 7.05 >> 1024 7.52 >> 2048 8.38 >> 4096 10.36 >> 8192 14.18 >> 16384 23.69 >> 32768 31.91 >> 65536 38.89 >> 131072 47.76 >> 262144 80.42 >> 524288 137.52 >> 1048576 251.81 >> 2097152 485.23 >> 4194304 948.08 >> b. have no explicit rdma setting. >> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node >> osu_latency >> # OSU MPI Latency Test v4.3 >> # Size Latency (us) >> 0 5.23 >> 1 5.28 >> 2 5.21 >> 4 5.33 >> 8 5.33 >> 16 5.36 >> 32 5.33 >> 64 5.41 >> 128 6.74 >> 256 6.98 >> 512 7.11 >> 1024 7.47 >> 2048 8.46 >> 4096 10.38 >> 8192 14.30 >> 16384 21.20 >> 32768 31.21 >> 65536 39.85 >> 131072 47.70 >> 262144 80.24 >> 524288 137.59 >> 1048576 251.62 >> 2097152 485.14 >> 4194304 945.80 >> [root@bb-nsi-ib04 pt2pt]# >> >> I found that the bandwidth got from osu_bw benchmark is equal to 40G RDMA >> HCA, so I doubt if the traffic always goes between 40G RDMA link, and the >> control for TCP/IP link does work. >> >> I will consult the FAQ for details, if further suggestion, welcome.. >> >> Thanks >> --Yanfei >> 发件人: devel [mailto:devel-boun...@open-mpi.org] 代表 Ralph Castain >> 发送时间: 2014年3月27日 18:05 >> 收件人: Open MPI Developers >> 主题: Re: [OMPI devel] doubt on latency result with OpenMPI library >> >> Try adding "--map-by node" to your command line to ensure the procs really >> are running on separate nodes. >> >> >> >> On Thu, Mar 27, 2014 at 1:40 AM, Wang,Yanfei(SYS) <wangyanfe...@baidu.com> >> wrote: >> Hi, >> >> HW Test Topology: >> Ip:192.168.72.4/24 �C192.168.72.4/24, enable vlan and RoCE >> IB03 server 40G port-- - 40G Ethernet switch ----IB04 server 40G port: >> configure it as RoCE link >> IP: 192.168.71.3/24 ---192.168.71.4/24 >> IB03 server 10G port �C 10G Ethernet switch �C IB04 server 10G port: >> configure it as normal TCP/IP Ethernet link:(server management interface) >> >> Mpi configuration: >> MPI Hosts file: >> [root@bb-nsi-ib04 pt2pt]# cat hosts >> ib03 slots=1 >> ib04 slots=1 >> DNS hosts >> [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts >> 192.168.71.3 ib03 >> 192.168.71.4 ib04 >> [root@bb-nsi-ib04 pt2pt]# >> This configuration will create 2 nodes for MPI latency evaluation >> >> Benchmark: >> osu-micro-benchmarks-4.3 >> >> result: >> a. Enable traffic go between 10G TCP/IP port using following >> /etc/hosts file >> >> >> root@bb-nsi-ib04 pt2pt]# cat /etc/hosts >> 192.168.71.3 ib03 >> 192.168.71.4 ib04 >> The average latency is 4.5us of osu_latency, see log following: >> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 osu_latency >> # OSU MPI Latency Test v4.3 >> # Size Latency (us) >> 0 4.56 >> 1 4.90 >> 2 4.90 >> 4 4.60 >> 8 4.71 >> 16 4.72 >> 32 5.40 >> 64 4.77 >> 128 6.74 >> 256 7.01 >> 512 7.14 >> 1024 7.63 >> 2048 8.22 >> 4096 10.39 >> 8192 14.26 >> 16384 20.80 >> 32768 31.97 >> 65536 37.75 >> 131072 47.28 >> 262144 80.40 >> 524288 137.65 >> 1048576 250.17 >> 2097152 484.71 >> 4194304 946.01 >> >> b. Enable traffic go between RoCE link using /etc/hosts as following >> and mpirun �Cmca btl openib,self,sm … >> >> [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts >> 192.168.72.3 ib03 >> 192.168.72.4 ib04 >> Result: >> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --mca btl >> openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency >> # OSU MPI Latency Test v4.3 >> # Size Latency (us) >> 0 4.83 >> 1 5.17 >> 2 5.12 >> 4 5.25 >> 8 5.38 >> 16 5.40 >> 32 5.19 >> 64 5.04 >> 128 6.74 >> 256 7.04 >> 512 7.34 >> 1024 7.91 >> 2048 8.17 >> 4096 10.39 >> 8192 14.22 >> 16384 22.05 >> 32768 31.68 >> 65536 37.57 >> 131072 48.25 >> 262144 79.98 >> 524288 137.66 >> 1048576 251.38 >> 2097152 485.66 >> 4194304 947.81 >> [root@bb-nsi-ib04 pt2pt]# >> >> Question: >> 1. Why do they have similar latency, 5us, which is too small to >> believe it! In our test environment, it will take more than 50 us to deal >> with tcp sync and return sync_ack, and also x86 server will take more thans >> 20us at average to do ip forwarding(test from professional HW tester), so >> does the latency is reasonable? >> >> 2. Normally, the switch will introduces more than 1.5us switch time! >> Using accelio, a mellanox released opensource rdma library, it will take at >> least 4 us rtt latency to do simpe ping-pong test. So 5 us MPI latency >> (TCP/IP and RoCE) above is rather unbelievable… >> >> 3. The fact that the tcp/ip transport and roce RDMA transport acquire >> same latency is so puzzling.. >> >> >> >> Before deeply understanding what happened inside the MPI benchmark, can show >> us some suggestion? Does the mpirun command works correctly here? >> It must has some mistakes about this test, pls correct me,. >> >> Eg: tcp syn&sync ack latency: >> <image001.png> >> >> Thanks >> -Yanfei >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/03/14400.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/03/14403.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14404.php