[mlpack] Hints for A3C/PPO

Shangtong Zhang Mon, 19 Feb 2018 08:42:59 -0800

Hi Chirag,

I think it would be better to also cc the mail list.

I assume you are trying to implement A3C or something likes this.
Actually this has almost been done. See my PR  
https://github.com/mlpack/mlpack/pull/934 
<https://github.com/mlpack/mlpack/pull/934>
This is my work last summer. To compute the gradient, you can use 
src/mlpack/methods/ann/layer/policy.hpp 
<https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35>
And there is also an actor_critic worker to show how to use this.

The most annoying thing is it doesn’t work and I don’t know why. I and Marcus 
tried hard but didn’t find any obvious logic bug.
So if you want tor implement A3C I think the simplest way is to find the bug.
I have some hints for you:
1. Even we don’t have shared layers among actor and critic, A3C should work 
well on small task like CartPole. If you do want shared layers, you need to 
look into https://github.com/mlpack/mlpack/pull/1091 
<https://github.com/mlpack/mlpack/pull/1091> (I highly recommend you not to do 
this first, as this is not critical)
2. I believe the bug may lie in the async mechanism so it’s difficult to debug 
(It’s possible I’m wrong). A good practice I think is to implement A2C and 
corresponding PPO, which I believe is the state-of-the-art technique. You can 
implement the vectorized environment, i.e. the interaction with the environment 
is parallelized and synchronous, while the optimization occurs at a single 
thread. See OpenAI baselines (tensorflow, https://github.com/openai/baselines 
<https://github.com/openai/baselines>) or my A2C (pytorch, 
https://github.com/ShangtongZhang/DeepRL/blob/master/agent/A2C_agent.py 
<https://github.com/ShangtongZhang/DeepRL/blob/master/agent/A2C_agent.py>) to 
see how this idea works. I believe it’s much easier to implement and debug. 
Once you implement the vectorized environment, it’s easy to plugin all the 
algorithms, e.g. one/n-step q learning, n-step salsa, actor-critic and PPO. 
From my experience, if tuned properly, the speed is comparable to fully async 
implementations.
3. If you do want A3C and want to find that bug. I think you can implement 
actor-critic with experience replay first to verify if it works in single 
thread case (Note this is wrong theoretically as to do this you need to use 
off-policy actor-critic, while in practice you can just ignore the importance 
sampling ratio and treat the data in the buffer as on-policy, it should work 
and is enough to check the implementation in small task like CartPole)

BTW your understanding about how forward and backward in DQN is absolutely 
right.

Hope this can help,

Best regards,

Shangtong Zhang,
Second year graduate student,
Department of Computing Science,
University of Alberta
Github <https://github.com/ShangtongZhang> | Stackoverflow 
<http://stackoverflow.com/users/3650053/slardar-zhang>
> On Feb 19, 2018, at 00:58, Chirag Ramdas <chiragram...@gmail.com> wrote:
> 
> I think I can probably write a custom compute_gradients() method for my 
> backprop here, but i wanted to know if mlpack's implementation provides me 
> with something similar to a convenient Forward() + Backword() pair which i 
> can use for my requirements here..
> 
> 
> 
> Yours Sincerely,
> 
> Chirag Pabbaraju,
> B.E.(Hons.) Computer Science Engineering,
> BITS Pilani K.K. Birla Goa Campus,
> Off NH17B, Zuarinagar,
> Goa, India
> chiragram...@gmail.com <mailto:chiragram...@gmail.com> | +91-9860632945
> 
> On Mon, Feb 19, 2018 at 1:26 PM, Chirag Ramdas <chiragram...@gmail.com 
> <mailto:chiragram...@gmail.com>> wrote:
> Hello,
> 
> I had an implementation question to ask.. So from the neural network 
> implementation i saw (ffn_impl), eg.  lines 146-156 
> <https://github.com/chogba/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp#L146-L156>
>  , you first forwarded the network on the states and saw what output (Q 
> value) it was giving for each action.. thereafter, you updated the targets 
> for the actions which you actually saw from your experience replay mechanism 
> and this updated target matrix now behaves like your labels which you wanted 
> the neural net to actually predict.. now, i saw from the q_learning_test.hpp 
> file that you are initialising the FFN with MeanSquaredError, so i am 
> assuming if you pass this target matrix to learningNetwork.Backward(), it 
> computes the gradients of the mean squared error with respect to all the 
> parameters. Thereafter, with these gradients and the optimizer which you have 
> specified eg.Adam,etc,  updater.Update() updates the parameters of the 
> network.
> Do correct me if i was wrong anywhere..
> 
> So now my question is.. I am faced with a custom optimisation function, and i 
> am required to compute gradients of this function with respect to each of the 
> parameters of my neural net.. The Forward() + Backward() pair which was 
> called in the above implementation required me to compute 1) what my network 
> computes for an input 2) what i believe it should have computed, and 
> thereafter computes the gradients by itself. But I simply have an objective 
> function (no notion of what the network should have computed ie labels) and 
> correspondingly an update rule which i want to follow..
> 
> Precisely, i have a policy function pi which is approximated by a neural net 
> parameterised by theta, and which outputs the probabilities of performing 
> each action given a state.. now, i want the following update rule for the 
> parameters..
> 
> <Screen Shot 2018-02-19 at 1.10.32 PM.png>
> 
> 
> basically, i am asking if i can have my neural net optimise an objective 
> function which i myself specify, in some form.
> I looked at the implementation of ffn, but i couldn't figure out how i could 
> do this.. hope my question was clear.. 
> 
> Thanks a lot!
> 
> Yours Sincerely,
> 
> Chirag Pabbaraju,
> B.E.(Hons.) Computer Science Engineering,
> BITS Pilani K.K. Birla Goa Campus,
> Off NH17B, Zuarinagar,
> Goa, India
> chiragram...@gmail.com <mailto:chiragram...@gmail.com> | +91-9860632945
>

_______________________________________________
mlpack mailing list
mlpack@lists.mlpack.org
http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack

[mlpack] Hints for A3C/PPO

Reply via email to