The way I debugged the implementation is similar to the code I posted above. I ran the OpenAI baselines code with my implementation of PPO, made sure they were initialized the same, and stepped through comparing the weights and gradients.
I found immediately my value function was incorrect. Also, double check your initialization to begin with. Sometimes PPO was very sensitive to the weight initialisation. Hope this helps! [ Full content available at: https://github.com/apache/incubator-mxnet/issues/10563 ] This message was relayed via gitbox.apache.org for [email protected]
