slitsey commented on issue #11466: [MXNET-560] Add temperature parameter in Softmax operator URL: https://github.com/apache/incubator-mxnet/pull/11466#issuecomment-402794615 @apeforest @sxjscience I think backpropagation will be different for a SoftmaxOutput final layer, because the derivative of the softmax with temperature will have an extra factor of 1/temperature. This leads to an extra factor of 1/temperature in all the d error/d weight terms. (Someone is welcome to double-check my math.) So it's not the same as uniformly scaling the data. Another way to think about this is that the derivatives are still with respect to the unscaled data. In other words, d f(x/T) /d x is what is used in backpropagation, not d f(x/T) /d (x/T), where f is the usual softmax function (temperature of 1). See also the comment [here](https://math.stackexchange.com/questions/1579601/what-temperature-of-softmax-layer-should-i-use-during-neural-network-training) by isarandi on the first answer.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
