The PPO paper primarily depended on SGD and used Adam only as an alternative 
for better performance. Given he online nature of the problem, I would be 
surprised if SGD makes a fundamental difference.

Also, while the KL term stabilizes the objective, PPO may be too conservative 
if there is no explicit exploration. Weight divergence is expected: any optimal 
policies must be deterministic, i.e. saturate. 

There were some reproducibility discussions around PPO and TRPO. You may want 
to try a few more seeds on the original baseline as well.

My 2cents. 

[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/10563 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to