Hi,
Thanks Ralph and Jeff.

Do we have full documentation on these parameters, further on open MPI 
transport design architecture?
Pls recommend some website or paper.

Thanks
Yanfei

Sent from my iPad

On 2014年3月27日, at 下午10:10, "Ralph Castain" 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:

Just one other point to clarify - there is an apparent misunderstanding 
regarding the following MCA param:

-mca btl_openib_cpc_include rdmacm

This param has nothing to do with telling openib to use RDMA for communication. 
What it does is tell the openib BTL to use RDMA to establish the point-to-point 
connection between the two processes. The actual messaging may or may not use 
RDMA to move the bytes - that's a totally separate code path.



On Thu, Mar 27, 2014 at 6:21 AM, Wang,Yanfei(SYS) 
<wangyanfe...@baidu.com<mailto:wangyanfe...@baidu.com>> wrote:
it seem that all confusions have already been shot, thanks Jeff!

Thanks!
Yanfei

Sent from my iPad

> On 2014年3月27日, at 下午8:38, "Jeff Squyres (jsquyres)" 
> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
>
> Here's a few key facts that might help:
>
> 1. The hostfile has nothing to do with what network interfaces are used for 
> MPI traffic.  It is only used to specify what servers you launch on, 
> regardless of what IP interface on that server you specify.
> 2. What network interfaces are used are a combination of the BTL selected and 
> then any optional additional parameters given that that BTL.
> 3. If you do not specify any BTL, then Open MPI will choose the "best" ones, 
> and use that.
> 4. As of somewhere in the v1.7.x series, the ompi_info command only shows a 
> few MCA parameters by default.  To see all MCA parameters, add "--level 9" to 
> the command line.
>
> In your case, if you didn't specify a BTL, Open MPI would see your RoCE 
> interfaces and therefore choose the openib BTL for off-node communication 
> (and exclude the TCP BTL, because it is "worse" than the openib BTL), sm for 
> on-node communication, and self for loopback communication.
>
> If you specify --mca btl tcp,sm,self, then you are restricting OMPI's pool of 
> BTLs that it can choose from -- meaning that the openib BTL won't even be 
> considered.  So OMPI will therefore use the TCP BTL for off-node 
> communication.
>
> Also, remember that you can "mpirun ... hostname" (i.e., the Linux "hostname" 
> command) to verify what servers you are actually running on.
>
> I see that the ompi_info(1) man page is not super-detailed about the --level 
> option; I'll go fix that right now (and ensure it's in the v1.8 release).
>
>
>
>> On Mar 27, 2014, at 6:44 AM, "Wang,Yanfei(SYS)" 
>> <wangyanfe...@baidu.com<mailto:wangyanfe...@baidu.com>> wrote:
>>
>> Hi,
>>
>> Update:
>> If explicitly assign --mca btl tcp,sm,self and the traffic will go 10G 
>> TCP/IP link instead of 40G RDMA link, and the tcp/ip latency is 22us at 
>> average, which is reasonable.
>> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node --mca 
>> btl tcp,sm,self osu_latency
>> # OSU MPI Latency Test v4.3
>> # Size          Latency (us)
>> 0                      22.07
>> 1                      22.48
>> 2                      22.38
>> 4                      22.39
>> 8                      22.52
>> 16                     22.52
>> 32                     22.59
>> 64                     22.73
>> 128                    23.01
>> 256                    24.32
>> 512                    28.50
>> 1024                   31.06
>> 2048                   56.06
>> 4096                   68.53
>> 8192                   77.09
>> 16384                 105.23
>> 32768                 143.51
>> 65536                 229.79
>> 131072                285.28
>> 262144                423.26
>> 524288                693.82
>> 1048576              1634.03
>> 2097152              3311.69
>> 4194304              7055.16
>>
>> The conclusion is that the “ �Chostfile with 10G IP address” does enable 
>> traffic select 10G TCP/IP link, and mpirun select RDMA link by default even 
>> if you did not enable  “--mca btl openib,sm,self”!
>> So, how to understand that “�Chostfile” does not work fine and how to 
>> control the multi-HCA(NIC) traffic for MPI library?
>>
>> Besides, the following command does not reflect any information about rdma 
>> transport parameter control except tcp parameter.
>>
>> [root@bb-nsi-ib04 pt2pt]# ompi_info --param btl all
>>                 MCA btl: parameter "btl_tcp_if_include" (current value: "",
>>                          data source: default, level: 1 user/basic, type:
>>                          string)
>>                          Comma-delimited list of devices and/or CIDR
>>                          notation of networks to use for MPI communication
>>                          (e.g., 
>> "eth0,192.168.0.0/16<http://192.168.0.0/16>").  Mutually exclusive
>>                          with btl_tcp_if_exclude.
>>                 MCA btl: parameter "btl_tcp_if_exclude" (current value:
>>                          "127.0.0.1/8,sppp<http://127.0.0.1/8,sppp>", data 
>> source: default, level: 1
>>                          user/basic, type: string)
>>                          Comma-delimited list of devices and/or CIDR
>>                          notation of networks to NOT use for MPI
>>                          communication -- all devices not matching these
>>                          specifications will be used (e.g.,
>>                          "eth0,192.168.0.0/16<http://192.168.0.0/16>").  If 
>> set to a non-default
>>                          value, it is mutually exclusive with
>>                          btl_tcp_if_include.
>> [root@bb-nsi-ib04 pt2pt]#
>>
>> Hope to have a deep understand on it~
>>
>> Thanks
>> --Yanfei
>>
>> 发件人: devel 
>> [mailto:devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>] 代表 
>> Wang,Yanfei(SYS)
>> 发送时间: 2014年3月27日 18:17
>> 收件人: Open MPI Developers
>> 主题: [OMPI devel] 答复: doubt on latency result with OpenMPI library
>>
>> HI,
>>
>> “--map-by node” does remove this trouble.
>> ---
>> Configuration:
>> Even if using mpi --hostfile to control traffic to go 10G TCP/IP network, 
>> and the latency still is 5us in both situation!
>> [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
>> 192.168.71.3 ib03
>> 192.168.71.4 ib04
>> [root@bb-nsi-ib04 pt2pt]# ifconfig
>> eth0      Link encap:Ethernet  HWaddr 20:0B:C7:26:3F:C3
>>          inet addr:192.168.71.4  Bcast:192.168.71.255  Mask:255.255.255.0
>>          inet6 addr: fe80::220b:c7ff:fe26:3fc3/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:834635 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:339853 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:681908607 (650.3 MiB)  TX bytes:103031295 (98.2 MiB)
>> 10G eth0 is not rdma-enabled nic~
>>
>> a.       using openib
>> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node --mca 
>> btl openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency
>> # OSU MPI Latency Test v4.3
>> # Size          Latency (us)
>> 0                       5.20
>> 1                       5.36
>> 2                       5.31
>> 4                       5.34
>> 8                       5.46
>> 16                      5.35
>> 32                      5.44
>> 64                      5.48
>> 128                     6.74
>> 256                     6.87
>> 512                     7.05
>> 1024                    7.52
>> 2048                    8.38
>> 4096                   10.36
>> 8192                   14.18
>> 16384                  23.69
>> 32768                  31.91
>> 65536                  38.89
>> 131072                 47.76
>> 262144                 80.42
>> 524288                137.52
>> 1048576               251.81
>> 2097152               485.23
>> 4194304               948.08
>> b.       have no explicit rdma setting.
>> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --map-by node 
>> osu_latency
>> # OSU MPI Latency Test v4.3
>> # Size          Latency (us)
>> 0                       5.23
>> 1                       5.28
>> 2                       5.21
>> 4                       5.33
>> 8                       5.33
>> 16                      5.36
>> 32                      5.33
>> 64                      5.41
>> 128                     6.74
>> 256                     6.98
>> 512                     7.11
>> 1024                    7.47
>> 2048                    8.46
>> 4096                   10.38
>> 8192                   14.30
>> 16384                  21.20
>> 32768                  31.21
>> 65536                  39.85
>> 131072                 47.70
>> 262144                 80.24
>> 524288                137.59
>> 1048576               251.62
>> 2097152               485.14
>> 4194304               945.80
>> [root@bb-nsi-ib04 pt2pt]#
>>
>> I found that the bandwidth got from osu_bw benchmark is equal to 40G RDMA 
>> HCA, so I doubt if the traffic always goes between 40G RDMA link, and the 
>> control for TCP/IP link does work.
>>
>> I will consult the FAQ for details, if further suggestion, welcome..
>>
>> Thanks
>> --Yanfei
>> 发件人: devel 
>> [mailto:devel-boun...@open-mpi.org<mailto:devel-boun...@open-mpi.org>] 代表 
>> Ralph Castain
>> 发送时间: 2014年3月27日 18:05
>> 收件人: Open MPI Developers
>> 主题: Re: [OMPI devel] doubt on latency result with OpenMPI library
>>
>> Try adding "--map-by node" to your command line to ensure the procs really 
>> are running on separate nodes.
>>
>>
>>
>> On Thu, Mar 27, 2014 at 1:40 AM, Wang,Yanfei(SYS) 
>> <wangyanfe...@baidu.com<mailto:wangyanfe...@baidu.com>> wrote:
>> Hi,
>>
>> HW Test Topology:
>> Ip:192.168.72.4/24<http://192.168.72.4/24> 
>> �C192.168.72.4/24<http://192.168.72.4/24>, enable vlan and RoCE
>> IB03 server 40G port-- - 40G Ethernet switch ----IB04 server 40G port: 
>> configure it as RoCE link
>> IP: 192.168.71.3/24<http://192.168.71.3/24> 
>> ---192.168.71.4/24<http://192.168.71.4/24>
>> IB03 server 10G port �C 10G Ethernet switch �C IB04 server 10G port: 
>> configure it as normal TCP/IP Ethernet link:(server management interface)
>>
>> Mpi configuration:
>> MPI Hosts file:
>> [root@bb-nsi-ib04 pt2pt]# cat hosts
>> ib03 slots=1
>> ib04 slots=1
>> DNS hosts
>> [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
>> 192.168.71.3 ib03
>> 192.168.71.4 ib04
>> [root@bb-nsi-ib04 pt2pt]#
>> This configuration will create 2 nodes for MPI latency evaluation
>>
>> Benchmark:
>> osu-micro-benchmarks-4.3
>>
>> result:
>> a.       Enable traffic go between 10G TCP/IP port using following 
>> /etc/hosts file
>>
>>
>> root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
>> 192.168.71.3 ib03
>> 192.168.71.4 ib04
>> The average latency is 4.5us of osu_latency, see log following:
>> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 osu_latency
>> # OSU MPI Latency Test v4.3
>> # Size          Latency (us)
>> 0                       4.56
>> 1                       4.90
>> 2                       4.90
>> 4                       4.60
>> 8                       4.71
>> 16                      4.72
>> 32                      5.40
>> 64                      4.77
>> 128                     6.74
>> 256                     7.01
>> 512                     7.14
>> 1024                    7.63
>> 2048                    8.22
>> 4096                   10.39
>> 8192                   14.26
>> 16384                  20.80
>> 32768                  31.97
>> 65536                  37.75
>> 131072                 47.28
>> 262144                 80.40
>> 524288                137.65
>> 1048576               250.17
>> 2097152               484.71
>> 4194304               946.01
>>
>> b.       Enable traffic go between RoCE link using /etc/hosts as following 
>> and mpirun �Cmca btl openib,self,sm …
>>
>> [root@bb-nsi-ib04 pt2pt]# cat /etc/hosts
>> 192.168.72.3 ib03
>> 192.168.72.4 ib04
>> Result:
>> [root@bb-nsi-ib04 pt2pt]# mpirun --hostfile hosts -np 2 --mca btl 
>> openib,self,sm --mca btl_openib_cpc_include rdmacm osu_latency
>> # OSU MPI Latency Test v4.3
>> # Size          Latency (us)
>> 0                       4.83
>> 1                       5.17
>> 2                       5.12
>> 4                       5.25
>> 8                       5.38
>> 16                      5.40
>> 32                      5.19
>> 64                      5.04
>> 128                     6.74
>> 256                     7.04
>> 512                     7.34
>> 1024                    7.91
>> 2048                    8.17
>> 4096                   10.39
>> 8192                   14.22
>> 16384                  22.05
>> 32768                  31.68
>> 65536                  37.57
>> 131072                 48.25
>> 262144                 79.98
>> 524288                137.66
>> 1048576               251.38
>> 2097152               485.66
>> 4194304               947.81
>> [root@bb-nsi-ib04 pt2pt]#
>>
>> Question:
>> 1.       Why do they have similar latency, 5us, which is too small to 
>> believe it! In our test environment, it will take more than 50 us to deal 
>> with tcp sync and return sync_ack, and also x86 server will take more thans 
>> 20us at average to do ip forwarding(test from professional HW tester), so 
>> does the latency is reasonable?
>>
>> 2.       Normally, the switch will introduces more than 1.5us switch time! 
>> Using accelio, a mellanox released opensource rdma library, it will take at 
>> least 4 us rtt latency to do simpe ping-pong test. So 5 us MPI latency 
>> (TCP/IP and RoCE) above is rather unbelievable…
>>
>> 3.       The fact that the tcp/ip transport and roce RDMA transport acquire 
>> same latency  is so puzzling..
>>
>>
>>
>> Before deeply understanding what happened inside the MPI benchmark, can show 
>> us some suggestion? Does the mpirun command works correctly here?
>> It must has some mistakes about this test, pls correct me,.
>>
>> Eg: tcp syn&sync ack latency:
>> <image001.png>
>>
>> Thanks
>> -Yanfei
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org<mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/03/14400.php
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org<mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/03/14403.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com<mailto:jsquy...@cisco.com>
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org<mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14404.php

_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/03/14405.php

_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/03/14406.php

Reply via email to