HI,

“--map-by node” does remove this trouble.
---
Configuration:
Even if using mpi --hostfile to control traffic to go 10G TCP/IP network, and 
the latency still is 5us in both situation!
[root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
192.168.71.3 ib03
192.168.71.4 ib04
[root@bb-nsi-ib04 pt2pt]# ifconfig
eth0      Link encap:Ethernet  HWaddr 20:0B:C7:26:3F:C3
          inet addr:192.168.71.4  Bcast:192.168.71.255  Mask:255.255.255.0
          inet6 addr: fe80::220b:c7ff:fe26:3fc3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:834635 errors:0 dropped:0 overruns:0 frame:0
          TX packets:339853 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:681908607 (650.3 MiB)  TX bytes:103031295 (98.2 MiB)
10G eth0 is not rdma-enabled nic~


a.       using openib
[root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node --mca btl 
openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency
# OSU MPI Latency Test v4.3
# Size          Latency (us)
0                       5.20
1                       5.36
2                       5.31
4                       5.34
8                       5.46
16                      5.35
32                      5.44
64                      5.48
128                     6.74
256                     6.87
512                     7.05
1024                    7.52
2048                    8.38
4096                   10.36
8192                   14.18
16384                  23.69
32768                  31.91
65536                  38.89
131072                 47.76
262144                 80.42
524288                137.52
1048576               251.81
2097152               485.23
4194304               948.08

b.       have no explicit rdma setting.
[root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node 
osu_latency
# OSU MPI Latency Test v4.3
# Size          Latency (us)
0                       5.23
1                       5.28
2                       5.21
4                       5.33
8                       5.33
16                      5.36
32                      5.33
64                      5.41
128                     6.74
256                     6.98
512                     7.11
1024                    7.47
2048                    8.46
4096                   10.38
8192                   14.30
16384                  21.20
32768                  31.21
65536                  39.85
131072                 47.70
262144                 80.24
524288                137.59
1048576               251.62
2097152               485.14
4194304               945.80
[root@bb-nsi-ib04 pt2pt]#

I found that the bandwidth got from osu_bw benchmark is equal to 40G RDMA HCA, 
so I doubt if the traffic always goes between 40G RDMA link, and the control 
for TCP/IP link does work.

I will consult the FAQ for details, if further suggestion, welcome..

Thanks
--Yanfei
发件人: devel [mailto:devel-boun...@open-mpi.org] 代表 Ralph Castain
发送时间: 2014年3月27日 18:05
收件人: Open MPI Developers
主题: Re: [OMPI devel] doubt on latency result with OpenMPI library

Try adding "--map-by node" to your command line to ensure the procs really are 
running on separate nodes.

On Thu, Mar 27, 2014 at 1:40 AM, Wang,Yanfei(SYS) 
<wangyanfe...@baidu.com<mailto:wangyanfe...@baidu.com>> wrote:
Hi,

HW Test Topology:
Ip:192.168.72.4/24<http://192.168.72.4/24> 
–192.168.72.4/24<http://192.168.72.4/24>, enable vlan and RoCE
IB03 server 40G port-- - 40G Ethernet switch ----IB04 server 40G port: 
configure it as RoCE link
IP: 192.168.71.3/24<http://192.168.71.3/24> 
---192.168.71.4/24<http://192.168.71.4/24>
IB03 server 10G port – 10G Ethernet switch – IB04 server 10G port: configure it 
as normal TCP/IP Ethernet link:(server management interface)

Mpi configuration:
MPI Hosts file:
[root@bb-nsi-ib04 pt2pt]# cat hosts
ib03 slots=1
ib04 slots=1
DNS hosts
[root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
192.168.71.3 ib03
192.168.71.4 ib04
[root@bb-nsi-ib04 pt2pt]#
This configuration will create 2 nodes for MPI latency evaluation

Benchmark:
osu-micro-benchmarks-4.3

result:

a.       Enable traffic go between 10G TCP/IP port using following /etc/hosts 
file

root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
192.168.71.3 ib03
192.168.71.4 ib04
The average latency is 4.5us of osu_latency, see log following:
[root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 osu_latency
# OSU MPI Latency Test v4.3
# Size          Latency (us)
0                       4.56
1                       4.90
2                       4.90
4                       4.60
8                       4.71
16                      4.72
32                      5.40
64                      4.77
128                     6.74
256                     7.01
512                     7.14
1024                    7.63
2048                    8.22
4096                   10.39
8192                   14.26
16384                  20.80
32768                  31.97
65536                  37.75
131072                 47.28
262144                 80.40
524288                137.65
1048576               250.17
2097152               484.71
4194304               946.01


b.       Enable traffic go between RoCE link using /etc/hosts as following and 
mpirun –mca btl openib,self,sm …
[root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
192.168.72.3 ib03
192.168.72.4 ib04
Result:
[root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --mca btl 
openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency
# OSU MPI Latency Test v4.3
# Size          Latency (us)
0                       4.83
1                       5.17
2                       5.12
4                       5.25
8                       5.38
16                      5.40
32                      5.19
64                      5.04
128                     6.74
256                     7.04
512                     7.34
1024                    7.91
2048                    8.17
4096                   10.39
8192                   14.22
16384                  22.05
32768                  31.68
65536                  37.57
131072                 48.25
262144                 79.98
524288                137.66
1048576               251.38
2097152               485.66
4194304               947.81
[root@bb-nsi-ib04 pt2pt]#

Question:

1.       Why do they have similar latency, 5us, which is too small to believe 
it! In our test environment, it will take more than 50 us to deal with tcp sync 
and return sync_ack, and also x86 server will take more thans 20us at average 
to do ip forwarding(test from professional HW tester), so does the latency is 
reasonable?

2.       Normally, the switch will introduces more than 1.5us switch time! 
Using accelio, a mellanox released opensource rdma library, it will take at 
least 4 us rtt latency to do simpe ping-pong test. So 5 us MPI latency (TCP/IP 
and RoCE) above is rather unbelievable…

3.       The fact that the tcp/ip transport and roce RDMA transport acquire 
same latency  is so puzzling..


Before deeply understanding what happened inside the MPI benchmark, can show us 
some suggestion? Does the mpirun command works correctly here?
It must has some mistakes about this test, pls correct me,.

Eg: tcp syn&sync ack latency:
[cid:image001.png@01CF49E8.C8D81B40]

Thanks
-Yanfei

_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/03/14400.php

Reply via email to