Thanks Jeff!

It's very helpful, I will read all responses of this thread again to deep 
understand your opinions.  

Thanks 
Yanfei

-----邮件原件-----
发件人: devel [mailto:devel-boun...@open-mpi.org] 代表 Jeff Squyres (jsquyres)
发送时间: 2014年3月28日 19:18
收件人: Open MPI Developers
主题: Re: [OMPI devel] 答复: 答复: 答复: doubt on latency result with OpenMPI library

On Mar 27, 2014, at 11:45 PM, "Wang,Yanfei(SYS)" <wangyanfe...@baidu.com> wrote:

> 1. In the RoCE, we cannot use OOB(via tcp socket) for RDMA connection.  

More specifically, RoCE QPs can only be made using the RDMA connection manager.

> However, as I known, mellanox HCA supporting RoCE can make rdma and 
> tcp/ip work simultaneously. whether some other HCAs can only work on 
> RoCE and normal Ethernet individually,

FYI: Mellanox is the only RoCE vendor.

> so that OMPI cannot user OOB(like tcp socket) to build rdma connection except 
> RDMA_CM?   

You're mixing two different things: having the ability to run an OS IP stack 
over a RoCE-capable NIC is orthogonal to whether you can use some out-of-band 
method to make RoCE RC QPs.

I think you're misunderstanding what OMPI's "oob" QP connection mechanism did.  
Here's what it did:

1. MPI processes A and B (on different servers) would create half a QP 2. they 
would then extract the QP connection information from the half-created QP data 
structures (e.g., the unique QP number) -- A would extra Aa and B would extra 
Bb 3. A and B would exchange this information 4. A would use Bb to finish 
creating its QP, and B would use Aa to finish creating its QP.  This is a LOCAL 
operation -- it's effectively just filling in some data structures.
5. Now A and B have fully formed QPs and can use them to send/receive to each 
other.

The fact that #3 used TCP sockets to exchange information is irrelevant -- you 
could very well have printed out that information on a screen and hand-typed 
the information in at the peer.

The only important aspect is that the information had to be exchanged.  It 
doesn't matter whether you use TCP sockets or the actual RDMA CM.

*** Also keep in mind that OMPI's "oob" connection method for IB RC QPs in the 
openib BTL has been deleted, and has been wholly replaced with the "udcm" 
connection method (which uses UD QPs for #3, which act very much like UDP 
datagrams).

For IB, this method of "exchange critical connection information via an 
out-of-band method" works fine.  For RoCE, it's not possible -- there's 
additional, kernel-level (and possibly hardware-level? I don't know/remember 
offhand) information that cannot be extracted by userspace and exchanged via an 
out-of-band method.  Hence, you HAVE to use the RDMA CM to make RoCE QPs.

Let me make this totally clear: the fact that you have to use the RCMA CM to 
make RoCE RC QPs is not an OMPI choice.  It's mandated by how the RoCE 
technology works.  IB technology allows the "workaround" of extracting the 
necessary connection information such that we can use our "udcm" and not RDMA 
CM.

> I think, If OOB(like tcp) can run simultaneously with ROCE, the rdma 
> connection management would benefit from tcp socket's scalabitly , right?  
> 
> 2. Scalability of RDMA_CM.  
> Previously I also have few doubts on RDMA_CM ' scalability,  when I go deep 
> insight into source code of RDMA_CM library and corresponding kernel module, 
> eg, the shared single QP1 for connection requestion and response, which could 
> introduce severe lock contention if huge rdma connections exist and remote 
> NUMA memory access at multi-core platform; also lots of shared session 
> management data structures which could cause additional contention; 
> However, if the connection are not frequently destroyed and rebuilt, does the 
> scalability still have highly dependency on RDMA_CM?   
> To get further aware of UDCM, I would like to have a deep understanding on 
> rdma_CM's disadvantage.  

You'll have to ask Mellanox / the OpenFabrics community for insights about the 
RDMA CM.  To OMPI, that's the lower layer and we're just a consumer of it.

Keep in mind that the CM is only used during QP connection establishment -- 
it's not used after that.  So if it's a little less efficient, it usually 
doesn't matter (if it's a LOT less efficient, then it does matter, of course).

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/03/14418.php

Reply via email to