[OMPI devel] 答复: 答复: 答复: doubt on latency result with OpenMPI library

Wang,Yanfei(SYS) Thu, 27 Mar 2014 23:45:59 -0400 (EDT)

Hi，  

1. In the RoCE, we cannot use OOB(via tcp socket) for RDMA connection.  
However, as I known, mellanox HCA supporting RoCE can make rdma and tcp/ip work 
simultaneously. whether some other HCAs can only work on RoCE and normal 
Ethernet individually, so that OMPI cannot user OOB(like tcp socket) to build 
rdma connection except RDMA_CM?

I think, If OOB(like tcp) can run simultaneously with ROCE, the rdma connection 
management would benefit from tcp socket's scalabitly , right?  

2. Scalability of RDMA_CM.  
Previously I also have few doubts on RDMA_CM ' scalability,  when I go deep 
insight into source code of RDMA_CM library and corresponding kernel module, 
eg, the shared single QP1 for connection requestion and response, which could 
introduce severe lock contention if huge rdma connections exist and remote NUMA 
memory access at multi-core platform; also lots of shared session management 
data structures which could cause additional contention; 
However, if the connection are not frequently destroyed and rebuilt, does the 
scalability still have highly dependency on RDMA_CM?   
To get further aware of UDCM, I would like to have a deep understanding on 
rdma_CM's disadvantage.  

This thread has a lot of help on OMPI and RDMA transport setting for me, 
thanks!  

Thanks 
yanfei

-----邮件原件-----
发件人: devel [mailto:[email protected]] 代表 Jeff Squyres (jsquyres)
发送时间: 2014年3月28日 0:58
收件人: Open MPI Developers
主题: Re: [OMPI devel] 答复: 答复: doubt on latency result with OpenMPI library

On Mar 27, 2014, at 11:15 AM, "Wang,Yanfei(SYS)" <[email protected]> wrote:

> Normally we use rdma-cm to build rdma connection ,then create Qpairs to do 
> rdma data transmit ion, so what is the consideration for separating rdma-cm 
> connection built and data transmit ion at design stage? 

There's some history here...

Waaaay back in the day, the only way to make RC verbs connections over IB was 
to send QP numbers (and other info) out-of-band to a peer (e.g., via TCP 
sockets).  OMPI implemented this method in the openib BTL.

This had some scalability issues, though, so we eventually started 
experimenting with some other mechanisms for making RC QPs.  For example, we 
tried using the IB connection manager for a while (IBCM), but that ultimately 
got dropped.

The RDMA Connection Manager was always an option (RDMA CM), but we never 
bothered to implement it in OMPI until other technologies came along that 
*required* the use of the RDMA CM, namely: iWARP and RoCE.  Meaning: you 
*can't* make RC QPs over iWARP and RoCE over the OOB method, nor can you use 
the IB CM -- you *have* to use the RDMA CM.

RDMA CM has its own limitations, though.  So for IB RC QPs -- where you don't 
*have* to use the RDMA CM -- we recently implemented the UDCM, which basically 
does the same thing as the initial OOB method, but in a more scalable and 
efficient fashion (I'm leaving out the details; let me know if you want to hear 
them).

So at different times, we've had different numbers of mechanisms in OMPI for 
making these connections.  In the v1.7/v1.8 tree, I believe that the only 2 
left are the RDMA CM and the UDCM.

I also believe that for iWARP and RoCE, the RDMA CM will be chosen 
automatically, and UD CM will be automatically chosen for IB.

So after all that: I think you shouldn't need to specify the connection manager 
MCA parameter at all; the openib BTL should choose the Right one for you.

-- 
Jeff Squyres
[email protected]
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
devel mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/03/14409.php

[OMPI devel] 答复: 答复: 答复: doubt on latency result with OpenMPI library

Reply via email to