weijianwen commented on issue #5826: Does MXNet support RDMA over Converged 
Ethernet (ROCE)
URL: 
https://github.com/apache/incubator-mxnet/issues/5826#issuecomment-349537388
 
 
   @byronyi sounds like you're replacing TCP/IP with RoCE semantics on whcih 
pslite (perhaps ZeroMQ more specifically) relies. We got similar observation 
that pslite is agnostic to usage cases (CPUs or GPUs or muliti-node reduction) 
thus is rather straight-forward to adapt pslite onto a RDMA-enable fabric. 
However, rewriting pslite only, without reconsidering data communication 
pattern like that in GDR or NCCL,  may lose opportunities of further 
optimization given that uppper-layer info is absent.
   
   I think here are some questions worth considering when designing RDMA-enable 
MXNet, hopefully getting some insights from MXNet community and byronyi.
   
   1. Shall we build MXNet agnostic to network fabrics (Ethernet, RoCE, 
Infiniband), or build ones tailored for specific fabrics?
   2. Which approach is favored? 1) Add RDMA plugin to ZeroMQ like what 
@byronyi does to gRPC and TensorFlow; 2) or simply replace ZeroMQ with RoCE 
message passing semantics.
   3. Which parts for MXNet need redesign when porting to RDMA-enable networks? 
(Personally I think `Van` in pslite is heavily infuenced by ZeroMQ's APIs, and 
sometimes it looks weird for me.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to