[GitHub] [incubator-mxnet] threeleafzerg opened issue #12450: Loss normalizer needs to be all-reduced for softmax layer in distributed training if normalization type is set to valid_cnt

GitHub Mon, 03 Sep 2018 22:01:57 -0700

Description:
We are currently enabling the multi-node for mxnet sockeye and found that 
currently if the normalization type is valid the loss normalizer for softmax is 
not correct in distributed training. (softmax_output-inl.h)
The correct implementations should be:
If gradients are all-reduced in sum mode, valid_cnt should be allreduced .  
grads = grads / valid_cnt.
If gradients are all-reduced in average mode, valid_cnt should be allreduced 
too. grads = grads * node_num / valid_cnt.
The main reason is that: In topology such as SSD (CNN) or NMT (RNN), there's 
different valid_cnt in different nodes.


[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/12450 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [incubator-mxnet] threeleafzerg opened issue #12450: Loss normalizer needs to be all-reduced for softmax layer in distributed training if normalization type is set to valid_cnt

Reply via email to