We have some new Power8 nodes with dual-port FDR HCAs. I have not tested 
same-node Verbs throughput. Using Linux’s Cross Memory Attach (CMA), I can get 
30 GB/s for 2 MB messages between two cores and then it drops off to ~12 GB/s. 
The PCIe Gen3 x16 slots should max at ~15 GB/s. I agree that when there are 
more than two processes communicating, that shared memory will go higher while 
the PCIe link is capped at ~15 GB/s.

Scott

On Mar 11, 2015, at 1:41 PM, Howard Pritchard <[email protected]> wrote:

> My experience with DMA engines located on the other side of a PCI-e 16x gen3 
> bus from the cpus is that for a couple of ranks doing large
> transfers between each other on a node, using the DMA engine looks good.  But 
> once there are multiple ranks exchanging data (like up to 32 ranks on a dual 
> socket haswell node, not using HT),  using the DMA engine of the NIC is not 
> such a good idea.
> 
> Howard
> 
> 
> 2015-03-11 10:57 GMT-06:00 Nathan Hjelm <[email protected]>:
> 
> Definitely a side-effect though it could be beneficial in some cases as
> the RDMA engine in the HCA may be faster than using memcpy (larger than
> a certain size). I don't know how to best fix this as I need all RDMA
> capable BTLs to listed for RMA. I though about adding another list to
> track BTLs that have both RMA and atomics but that would increase the
> memory footprint of Open MPI by a factor of nranks.
> 
> -Nathan
> 
> On Thu, Feb 26, 2015 at 11:59:41PM +0000, Rolf vandeVaart wrote:
> >    This message is mostly for Nathan, but figured I would go with the wider
> >    distribution. I have noticed some different behaviour that I assume
> >    started with this change.
> >
> >    
> > https://github.com/open-mpi/ompi/commit/4bf7a207e90997e75ba1c60d9d191d9d96402d04
> >
> >    I am noticing that the openib BTL will also be used for on-node
> >    communication even though the sm (or smcuda) BTL is also available. I
> >    think with the aforementioned change that the openib BTL is listed as an
> >    available BTL that supports RDMA. While looking through the debugger and
> >    looking at the bml_endpoint, it appears that the sm BTL is listed as the
> >    eager and send BTL, but the openib is listed as the RDMA btl. Looking at
> >    the logic in pml_ob1_sendreq.h, it looks like we can end up selecting the
> >    openib btl for some of the communication. I ran with some various
> >    verbosity and saw that this was happening. With v1.8, we only appear to
> >    use the sm (or smcuda) btl.
> >
> >    I am wondering if this was intentional with this change or maybe a side
> >    effect.
> >
> >    Rolf
> >
> >      ----------------------------------------------------------------------
> >
> >    This email message is for the sole use of the intended recipient(s) and
> >    may contain confidential information.  Any unauthorized review, use,
> >    disclosure or distribution is prohibited.  If you are not the intended
> >    recipient, please contact the sender by reply email and destroy all 
> > copies
> >    of the original message.
> >
> >      ----------------------------------------------------------------------
> 
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/02/17065.php
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/03/17127.php
> 
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/03/17128.php

Reply via email to