I'm using distributed kvstore with Gluon trainer. I found the two following 
bugs:

1. Initializing `trainer = gluon.Trainer(update_on_kvstore=True)` doesn't work. 
Inspecting `trainer._update_on_kvstore` shows that the value is still set to 
`False`. 

2. When distributed kvstore is used, by default `gluon.Trainer` doesn't work 
with `mx.optimizer.LRScheduler` if a worker has more than 1 GPU. To be more 
specific, the trainer updates once per GPU, the `LRScheduler` object is shared 
across GPUs and get a wrong update count. 

This means one cannot train imagenet classification using resnet with gluon 
trainer. 

[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/12713 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to