Re: [mlpack] Hints for A3C/PPO

Shangtong Zhang Mon, 19 Feb 2018 09:23:05 -0800

For TRPO you need to read the original paper.. I don’t have better idea.
Starting from a vanilla policy gradient is good, however the main concern is 
that from my experience, you need either experience replay or multi-workers to 
make a non-linear function approximator work (they can give you uncorrelated 
data, which is crucial to train a network). Without them it may be hard to tune 
(although it’s possible if you work on small network and small task, it’s worth 
a trial).


Shangtong Zhang,
Second year graduate student,
Department of Computing Science,
University of Alberta
Github <https://github.com/ShangtongZhang> | Stackoverflow 
<http://stackoverflow.com/users/3650053/slardar-zhang>
> On Feb 19, 2018, at 10:13, Chirag Ramdas <[email protected]> wrote:
> 
> Hi Shangtong,
> 
> Thank you so very much for the detailed reply, I appreciate it a lot!
> 
> I spoke to Marcus about an initial contribution to make my GSoC proposal 
> strong, and he suggested me that i could implement a vanilla stochastic 
> policy gradients implementation.. So i was looking to implement a vanilla 
> implementation with a monte carlo value estimate as my advantage function - 
> basically just the simplest of implementations...
> 
> I am yet to fully theoretically understand TRPO and PPO, because they are 
> statistically quite heavy.. i mean the papers provide mechanical pseudocode, 
> but the intution on what is really happening is what i wish to understand.. 
> Towards this, i am trying to find blogs, and indeed the past few days have 
> gone in a beautiful RL blur! But it really has been so interesting.. if you 
> can provide some resources to understand the statistical intution behing 
> trust region algos, it would really be helpful!
> 
> Right now, i am just looking at implementing a single threaded vanilla policy 
> gradient algorithm. I will look at 
> https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35
>  
> <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35>,
>  and see how i can use it! I am not even looking at actor critic right now, 
> and PPO for sure is the state of the art, but that's way beyond scope for me 
> right now.. 
> 
> I am attaching a screenshot of what I am aiming at implementing
> What are your inputs on implementing this?
> Would you say that if i refer to the file you have mentioned, it should be 
> doable, considering a single threaded environment?
> 
> Thanks a lot again!
> 
> 
> On Feb 19, 2018 10:12 PM, "Shangtong Zhang" <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Chirag,
> 
> I think it would be better to also cc the mail list.
> 
> I assume you are trying to implement A3C or something likes this.
> Actually this has almost been done. See my PR  
> https://github.com/mlpack/mlpack/pull/934 
> <https://github.com/mlpack/mlpack/pull/934>
> This is my work last summer. To compute the gradient, you can use 
> src/mlpack/methods/ann/layer/policy.hpp 
> <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35>
> And there is also an actor_critic worker to show how to use this.
> 
> The most annoying thing is it doesn’t work and I don’t know why. I and Marcus 
> tried hard but didn’t find any obvious logic bug.
> So if you want tor implement A3C I think the simplest way is to find the bug.
> I have some hints for you:
> 1. Even we don’t have shared layers among actor and critic, A3C should work 
> well on small task like CartPole. If you do want shared layers, you need to 
> look into https://github.com/mlpack/mlpack/pull/1091 
> <https://github.com/mlpack/mlpack/pull/1091> (I highly recommend you not to 
> do this first, as this is not critical)
> 2. I believe the bug may lie in the async mechanism so it’s difficult to 
> debug (It’s possible I’m wrong). A good practice I think is to implement A2C 
> and corresponding PPO, which I believe is the state-of-the-art technique. You 
> can implement the vectorized environment, i.e. the interaction with the 
> environment is parallelized and synchronous, while the optimization occurs at 
> a single thread. See OpenAI baselines (tensorflow, 
> https://github.com/openai/baselines <https://github.com/openai/baselines>) or 
> my A2C (pytorch, 
> https://github.com/ShangtongZhang/DeepRL/blob/master/agent/A2C_agent.py 
> <https://github.com/ShangtongZhang/DeepRL/blob/master/agent/A2C_agent.py>) to 
> see how this idea works. I believe it’s much easier to implement and debug. 
> Once you implement the vectorized environment, it’s easy to plugin all the 
> algorithms, e.g. one/n-step q learning, n-step salsa, actor-critic and PPO. 
> From my experience, if tuned properly, the speed is comparable to fully async 
> implementations.
> 3. If you do want A3C and want to find that bug. I think you can implement 
> actor-critic with experience replay first to verify if it works in single 
> thread case (Note this is wrong theoretically as to do this you need to use 
> off-policy actor-critic, while in practice you can just ignore the importance 
> sampling ratio and treat the data in the buffer as on-policy, it should work 
> and is enough to check the implementation in small task like CartPole)
> 
> BTW your understanding about how forward and backward in DQN is absolutely 
> right.
> 
> Hope this can help,
> 
> Best regards,
> 
> Shangtong Zhang,
> Second year graduate student,
> Department of Computing Science,
> University of Alberta
> Github <https://github.com/ShangtongZhang> | Stackoverflow 
> <http://stackoverflow.com/users/3650053/slardar-zhang>
>> On Feb 19, 2018, at 00:58, Chirag Ramdas <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I think I can probably write a custom compute_gradients() method for my 
>> backprop here, but i wanted to know if mlpack's implementation provides me 
>> with something similar to a convenient Forward() + Backword() pair which i 
>> can use for my requirements here..
>> 
>> 
>> 
>> Yours Sincerely,
>> 
>> Chirag Pabbaraju,
>> B.E.(Hons.) Computer Science Engineering,
>> BITS Pilani K.K. Birla Goa Campus,
>> Off NH17B, Zuarinagar,
>> Goa, India
>> [email protected] <mailto:[email protected]> | +91-9860632945
>> 
>> On Mon, Feb 19, 2018 at 1:26 PM, Chirag Ramdas <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hello,
>> 
>> I had an implementation question to ask.. So from the neural network 
>> implementation i saw (ffn_impl), eg.  lines 146-156 
>> <https://github.com/chogba/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp#L146-L156>
>>  , you first forwarded the network on the states and saw what output (Q 
>> value) it was giving for each action.. thereafter, you updated the targets 
>> for the actions which you actually saw from your experience replay mechanism 
>> and this updated target matrix now behaves like your labels which you wanted 
>> the neural net to actually predict.. now, i saw from the q_learning_test.hpp 
>> file that you are initialising the FFN with MeanSquaredError, so i am 
>> assuming if you pass this target matrix to learningNetwork.Backward(), it 
>> computes the gradients of the mean squared error with respect to all the 
>> parameters. Thereafter, with these gradients and the optimizer which you 
>> have specified eg.Adam,etc,  updater.Update() updates the parameters of the 
>> network.
>> Do correct me if i was wrong anywhere..
>> 
>> So now my question is.. I am faced with a custom optimisation function, and 
>> i am required to compute gradients of this function with respect to each of 
>> the parameters of my neural net.. The Forward() + Backward() pair which was 
>> called in the above implementation required me to compute 1) what my network 
>> computes for an input 2) what i believe it should have computed, and 
>> thereafter computes the gradients by itself. But I simply have an objective 
>> function (no notion of what the network should have computed ie labels) and 
>> correspondingly an update rule which i want to follow..
>> 
>> Precisely, i have a policy function pi which is approximated by a neural net 
>> parameterised by theta, and which outputs the probabilities of performing 
>> each action given a state.. now, i want the following update rule for the 
>> parameters..
>> 
>> <Screen Shot 2018-02-19 at 1.10.32 PM.png>
>> 
>> 
>> basically, i am asking if i can have my neural net optimise an objective 
>> function which i myself specify, in some form.
>> I looked at the implementation of ffn, but i couldn't figure out how i could 
>> do this.. hope my question was clear.. 
>> 
>> Thanks a lot!
>> 
>> Yours Sincerely,
>> 
>> Chirag Pabbaraju,
>> B.E.(Hons.) Computer Science Engineering,
>> BITS Pilani K.K. Birla Goa Campus,
>> Off NH17B, Zuarinagar,
>> Goa, India
>> [email protected] <mailto:[email protected]> | +91-9860632945
>> 
> 
> <Screenshot_20180218-212917.jpg>

_______________________________________________
mlpack mailing list
[email protected]
http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack

Re: [mlpack] Hints for A3C/PPO

Reply via email to