The PPO paper primarily depended on SGD and used Adam only as an alternative for better performance. Given he online nature of the problem, I would be surprised if SGD makes a fundamental difference.
Also, while the KL term stabilizes the objective, PPO may be too conservative if there is no explicit exploration. Weight divergence is expected: any optimal policies must be deterministic, i.e. saturate. There were some reproducibility discussions around PPO and TRPO. You may want to try a few more seeds on the original baseline as well. My 2cents. [ Full content available at: https://github.com/apache/incubator-mxnet/issues/10563 ] This message was relayed via gitbox.apache.org for [email protected]
