YouhuiBai opened a new issue #15674: Straggler in latest mxnet when training with distributed parameter server URL: https://github.com/apache/incubator-mxnet/issues/15674 ## Description Hi, I found that there is a strange straggler in current newest mxnet when training CNN using distributed parameter server architecture (BSP model) and GPU, it is a special worker whose `rank == 0`. I think it is a bug of mxnet , because I deployed mxnet in hemogeneous environment means that every participated machine has the same hardware and software environment as follows, and the straggler still existed even I changed the number of workers or physical machine like running at AWS. ## Environment info (Required) ``` system version:CentOS 7.5.1804 kernel version:3.10.0-862.9.1.el7.x86_64 cuda version:cuda_9.2.148 cudnn version:cudnn-9.2-linux-x64-v7.1 nvidia driver version:396.37 GPU: GeForce GTX 1080 Ti NIC: 10GE ``` Software and parameters: ``` parameter server architecture: m servers n workers, n >= m and each role locates on different physical machine application: image classification database: Imagenet 2012 CNN model: inception-v4, lenet, resnet, etc. GPU usage: one physical GPU per worker scaling model: strong scaling consistency model: BSP ``` ## what is a straggler? When I start training with above environment and parameter set up, and did some break down in the critical path, found that a worker's behavior is strange. In BSP consistency model of parameter server, the server would not execute response of one key for push operations unless receiving updates from all workers to the same key, we found that there was a slower worker, always waited by other workers every iteration, it is the straggler. The straggler has other features: 1. rank == 0, the first worker 2. higher CPU usage 3. higher CPU memory throughput 4. higher GPU usage 5. cost more time when calling cudamemcopy 6. higher LLC miss rate 7. lower CPU memory occupancy It's very very strange. Thanks a lot.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
