It turns out the primary cause for our performance problem was solved after making this post, but I was still stuck on tracing down the cause of the weight divergence. Digging in to both implementation of Adam, it seemed that, at least algebraically, both were computing the same thing (and, in the example above, all hyper-parameters were set the same).
My best guess for the weight divergence is simply the order of operations in which things are calculated. Once weights start to diverge (and they diverge between MXNet and TF even after a single tanh) to a large enough amount, then they will continue to diverge as the gradients of each will be different. This is not a very satisfying answer, but seems to be the case. [ Full content available at: https://github.com/apache/incubator-mxnet/issues/10563 ] This message was relayed via gitbox.apache.org for [email protected]
