slitsey commented on issue #11466: [MXNET-560] Add temperature parameter in 
Softmax operator
URL: https://github.com/apache/incubator-mxnet/pull/11466#issuecomment-402794615
 
 
   @apeforest @sxjscience I think backpropagation will be different for a 
SoftmaxOutput final layer, because the derivative of the softmax with 
temperature will have an extra factor of 1/temperature. This leads to an extra 
factor of 1/temperature in all the d error/d weight terms. (Someone is welcome 
to double-check my math.) So it's not the same as uniformly scaling the data. 
Another way to think about this is that the derivatives are still with respect 
to the unscaled data. In other words, d f(x/T) /d x is what is used in 
backpropagation, not d f(x/T) /d (x/T), where f is the usual softmax function 
(temperature of 1). See also the comment 
[here](https://math.stackexchange.com/questions/1579601/what-temperature-of-softmax-layer-should-i-use-during-neural-network-training)
 by isarandi on the first answer.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to