Hi, HW Test Topology: Ip:192.168.72.4/24 �C192.168.72.4/24, enable vlan and RoCE IB03 server 40G port-- - 40G Ethernet switch ----IB04 server 40G port: configure it as RoCE link IP: 192.168.71.3/24 ---192.168.71.4/24 IB03 server 10G port �C 10G Ethernet switch �C IB04 server 10G port: configure it as normal TCP/IP Ethernet link:(server management interface)
Mpi configuration: MPI Hosts file: [root@bb-nsi-ib04 pt2pt]# cat hosts ib03 slots=1 ib04 slots=1 DNS hosts [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts 192.168.71.3 ib03 192.168.71.4 ib04 [root@bb-nsi-ib04 pt2pt]# This configuration will create 2 nodes for MPI latency evaluation Benchmark: osu-micro-benchmarks-4.3 result: a. Enable traffic go between 10G TCP/IP port using following /etc/hosts file root@bb-nsi-ib04 pt2pt]# cat /etc/hosts 192.168.71.3 ib03 192.168.71.4 ib04 The average latency is 4.5us of osu_latency, see log following: [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 osu_latency # OSU MPI Latency Test v4.3 # Size Latency (us) 0 4.56 1 4.90 2 4.90 4 4.60 8 4.71 16 4.72 32 5.40 64 4.77 128 6.74 256 7.01 512 7.14 1024 7.63 2048 8.22 4096 10.39 8192 14.26 16384 20.80 32768 31.97 65536 37.75 131072 47.28 262144 80.40 524288 137.65 1048576 250.17 2097152 484.71 4194304 946.01 b. Enable traffic go between RoCE link using /etc/hosts as following and mpirun �Cmca btl openib,self,sm … [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts 192.168.72.3 ib03 192.168.72.4 ib04 Result: [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency # OSU MPI Latency Test v4.3 # Size Latency (us) 0 4.83 1 5.17 2 5.12 4 5.25 8 5.38 16 5.40 32 5.19 64 5.04 128 6.74 256 7.04 512 7.34 1024 7.91 2048 8.17 4096 10.39 8192 14.22 16384 22.05 32768 31.68 65536 37.57 131072 48.25 262144 79.98 524288 137.66 1048576 251.38 2097152 485.66 4194304 947.81 [root@bb-nsi-ib04 pt2pt]# Question: 1. Why do they have similar latency, 5us, which is too small to believe it! In our test environment, it will take more than 50 us to deal with tcp sync and return sync_ack, and also x86 server will take more thans 20us at average to do ip forwarding(test from professional HW tester), so does the latency is reasonable? 2. Normally, the switch will introduces more than 1.5us switch time! Using accelio, a mellanox released opensource rdma library, it will take at least 4 us rtt latency to do simpe ping-pong test. So 5 us MPI latency (TCP/IP and RoCE) above is rather unbelievable… 3. The fact that the tcp/ip transport and roce RDMA transport acquire same latency is so puzzling.. Before deeply understanding what happened inside the MPI benchmark, can show us some suggestion? Does the mpirun command works correctly here? It must has some mistakes about this test, pls correct me,. Eg: tcp syn&sync ack latency: [cid:image001.png@01CF49D9.EE47FDA0] Thanks -Yanfei