I'm using distributed kvstore with Gluon trainer. I found the two following bugs:
1. Initializing `trainer = gluon.Trainer(update_on_kvstore=True)` doesn't work. Inspecting `trainer._update_on_kvstore` shows that the value is still set to `False`. 2. When distributed kvstore is used, by default `gluon.Trainer` doesn't work with `mx.optimizer.LRScheduler` if a worker has more than 1 GPU. To be more specific, the trainer updates once per GPU, the `LRScheduler` object is shared across GPUs and get a wrong update count. This means one cannot train imagenet classification using resnet with gluon trainer. [ Full content available at: https://github.com/apache/incubator-mxnet/issues/12713 ] This message was relayed via gitbox.apache.org for [email protected]
