The PPO paper primarily depended on SGD and used Adam only as an alternative 
for better performance. Given he online nature of the problem, I would be 
surprised if SGD makes a fundamental difference.

Also, while the KL term stabilizes the objective and good to have, PPO may be 
too conservative if there is no explicit exploration. Weight divergence is 
expected in the end: any optimal policies must be deterministic, i.e. saturate 
(except in adversarial bandits). 

There were some reproducibility discussions around PPO and TRPO. You may want 
to try a few more seeds on the original baseline as well.

My 2cents. 

[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/10563 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to