solin319 opened a new issue #8097: speed problem in distribute training URL: https://github.com/apache/incubator-mxnet/issues/8097 ## 1. In file "kvstore_dist_server.py" Change ``` 419 // TODO(mli) try to remove this CopyFrom 420 response.vals.CopyFrom(static_cast<const float*>(stored.data().dptr_), len); ``` to ``` response.vals = ps::SArray<float>(stored.data().dptr<float>(), len, false); ``` In a vgg16 training with two distribute machines (total 8 gpus), it can accelerate **20** samples/sec. Is this method correct? ## 2. In file "kvstore_dist.py" Delete the line 275 "send_buf.WaitToWrite();", can accelerate the speed with kvstore='sync_device' or 'local'. In profile, I can find this WaitToWrite cause all the push locked until the whole backward finished. Is this WaitToWrite necessary? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
With regards, Apache Git Services
