[GitHub] [incubator-mxnet] YouhuiBai opened a new issue #15674: Straggler in latest mxnet when training with distributed parameter server

GitBox Sat, 27 Jul 2019 01:26:38 -0700

YouhuiBai opened a new issue #15674: Straggler in latest mxnet when training 
with distributed parameter server
URL: https://github.com/apache/incubator-mxnet/issues/15674
 
 
   
   ## Description
   Hi, I found that there is a strange straggler in current newest mxnet when 
training CNN using distributed parameter server architecture (BSP model) and 
GPU, it is a special worker whose `rank == 0`. I think it is a bug of mxnet , 
because I deployed mxnet in hemogeneous environment means that every 
participated machine has the same hardware and software environment as follows, 
and the straggler still existed even I changed the number of workers or 
physical machine like running at AWS. 
   
   ## Environment info (Required)
   
   ```
   system version：CentOS 7.5.1804
   kernel version：3.10.0-862.9.1.el7.x86_64
   cuda version：cuda_9.2.148
   cudnn version：cudnn-9.2-linux-x64-v7.1
   nvidia driver version：396.37
   GPU: GeForce GTX 1080 Ti
   NIC: 10GE
   
   ```
   Software and parameters:
   ```
   parameter server architecture: m servers n workers, n >= m and each role 
locates on different physical machine
   application: image classification
   database: Imagenet 2012
   CNN model: inception-v4, lenet, resnet, etc.
   GPU usage: one physical GPU per worker
   scaling model: strong scaling
   consistency model: BSP
   ```
   
   ## what is a straggler?
   When I start training with above environment and parameter set up, and did 
some break down in the critical path, found that a worker's behavior is 
strange. In BSP consistency model of parameter server, the server would not 
execute response of one key for push operations unless receiving updates from 
all workers to the same key, we found that there was a slower worker, always 
waited by other workers every iteration, it is the straggler. The straggler has 
other features:
   
   1. rank == 0, the first worker
   2. higher CPU usage
   3. higher CPU memory throughput
   4. higher GPU usage
   5. cost more time when calling cudamemcopy
   6. higher LLC miss rate
   7. lower CPU memory occupancy
   
   It's very very strange. Thanks a lot.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] YouhuiBai opened a new issue #15674: Straggler in latest mxnet when training with distributed parameter server

Reply via email to