solin319 opened a new issue #8097: speed problem in distribute training
URL: https://github.com/apache/incubator-mxnet/issues/8097
 
 
   ## 1. In file "kvstore_dist_server.py"
   Change
   ```
   419      // TODO(mli) try to remove this CopyFrom
   420     response.vals.CopyFrom(static_cast<const 
float*>(stored.data().dptr_), len);
   ```
   to
   ```
   response.vals = ps::SArray<float>(stored.data().dptr<float>(), len, false);
   ```
   
   In a vgg16 training with two distribute machines (total 8 gpus),  it can 
accelerate **20** samples/sec. 
   Is this method correct?
   
   ## 2. In file "kvstore_dist.py"
   Delete the line 275 "send_buf.WaitToWrite();", can accelerate the speed with 
kvstore='sync_device' or 'local'.
   In profile, I can find this WaitToWrite cause all the push locked until the 
whole backward finished.
   Is this WaitToWrite necessary?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to