It turns out the primary cause for our performance problem was solved after 
making this post, but I was still stuck on tracing down the cause of the weight 
divergence. Digging in to both implementation of Adam, it seemed that, at least 
algebraically, both were computing the same thing (and, in the example above, 
all hyper-parameters were set the same). 

My best guess for the weight divergence is simply the order of operations in 
which things are calculated. Once weights start to diverge (and they diverge 
between MXNet and TF even after a single tanh) to a large enough amount, then 
they will continue to diverge as the gradients of each will be different.

This is not a very satisfying answer, but seems to be the case.

[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/10563 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to