weijianwen commented on issue #5826: Does MXNet support RDMA over Converged Ethernet (ROCE) URL: https://github.com/apache/incubator-mxnet/issues/5826#issuecomment-349537388 @byronyi sounds like you're replacing TCP/IP with RoCE semantics on whcih pslite (perhaps ZeroMQ more specifically) relies. We got similar observation that pslite is agnostic to usage cases (CPUs or GPUs or muliti-node reduction) thus is rather straight-forward to adapt pslite onto a RDMA-enable fabric. However, rewriting pslite only, without reconsidering data communication pattern like that in GDR or NCCL, may lose opportunities of further optimization given that uppper-layer info is absent. I think here are some questions worth considering when designing RDMA-enable MXNet, hopefully getting some insights from MXNet community and byronyi. 1. Shall we build MXNet agnostic to network fabrics (Ethernet, RoCE, Infiniband), or build ones tailored for specific fabrics? 2. Which approach is favored? 1) Add RDMA plugin to ZeroMQ like what @byronyi does to gRPC and TensorFlow; 2) or simply replace ZeroMQ with RoCE message passing semantics. 3. Which parts for MXNet need redesign when porting to RDMA-enable networks? (Personally I think `Van` in pslite is heavily infuenced by ZeroMQ's APIs, and sometimes it looks weird for me.)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
