chandana1332 opened a new issue #17237: Data imbalance handling in MXNet Gluon URL: https://github.com/apache/incubator-mxnet/issues/17237 Hello, My question regarding data imbalance handling in Gluon is as follows: Suppose I'm training with 4 GPUs. For an update, my training loop samples 4 batches (one for each GPU) and runs fwd/bkwd on them. Using a Gluon Trainer, I can reduce and update gradients on all 4 GPUs. Now I'm towards the end of an epoch and I only have 2 batches left to process. I sample those 2 batches, send them off to the first two GPUs, run fwd/bkwd. At this point, 2 GPUs have non-zero gradients. If I do a Trainer.step(), how does it reduce gradients on all GPUS? 1. Do the GPUs that didn't process a batch contribute zero gradients during the reduce operation ? So all GPUs participate in the redcution operation? 2. Do only the GPUs that have non-zero gradients send their gradients for reduction to a server and then the reduced gradient is broadcasted to all GPUs?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
