chandana1332 opened a new issue #17237: Data imbalance handling in MXNet Gluon
URL: https://github.com/apache/incubator-mxnet/issues/17237
 
 
   Hello,
   
   My question regarding data imbalance handling in Gluon is as follows:
   
   Suppose I'm training with 4 GPUs. For an update, my training loop samples 4 
batches (one for each GPU) and runs fwd/bkwd on them. Using a Gluon Trainer, I 
can reduce and update gradients on all 4 GPUs. 
   
   Now I'm towards the end of an epoch and I only have 2 batches left to 
process. I sample those 2 batches, send them off to the first two GPUs, run 
fwd/bkwd. At this point, 2 GPUs have non-zero gradients. If I do a 
Trainer.step(), how does it reduce gradients on all GPUS?
   
   1. Do the GPUs that didn't process a batch contribute zero gradients during 
the reduce operation ? So all GPUs participate in the redcution operation?
   2. Do only the GPUs that have non-zero gradients send their gradients for 
reduction to a server and then the reduced gradient is broadcasted to all GPUs?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to