[GitHub] [incubator-mxnet] eric-haibin-lin opened issue #12713: distributed kvstore bug in MXNet

GitHub Mon, 01 Oct 2018 16:17:37 -0700

I'm using distributed kvstore with Gluon trainer. I found the two following 
bugs:


1. Initializing `trainer = gluon.Trainer(update_on_kvstore=True)` doesn't work. 
Inspecting `trainer._update_on_kvstore` shows that the value is still set to 
`False`. 

2. When distributed kvstore is used, by default `gluon.Trainer` doesn't work 
with `mx.optimizer.LRScheduler` if a worker has more than 1 GPU. To be more 
specific, the trainer updates once per GPU, the `LRScheduler` object is shared 
across GPUs and get a wrong update count. 

This means one cannot train imagenet classification using resnet with gluon 
trainer. 

[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/12713 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [incubator-mxnet] eric-haibin-lin opened issue #12713: distributed kvstore bug in MXNet

Reply via email to