For TRPO you need to read the original paper.. I don’t have better idea. Starting from a vanilla policy gradient is good, however the main concern is that from my experience, you need either experience replay or multi-workers to make a non-linear function approximator work (they can give you uncorrelated data, which is crucial to train a network). Without them it may be hard to tune (although it’s possible if you work on small network and small task, it’s worth a trial).

## Advertising

Shangtong Zhang, Second year graduate student, Department of Computing Science, University of Alberta Github <https://github.com/ShangtongZhang> | Stackoverflow <http://stackoverflow.com/users/3650053/slardar-zhang> > On Feb 19, 2018, at 10:13, Chirag Ramdas <chiragram...@gmail.com> wrote: > > Hi Shangtong, > > Thank you so very much for the detailed reply, I appreciate it a lot! > > I spoke to Marcus about an initial contribution to make my GSoC proposal > strong, and he suggested me that i could implement a vanilla stochastic > policy gradients implementation.. So i was looking to implement a vanilla > implementation with a monte carlo value estimate as my advantage function - > basically just the simplest of implementations... > > I am yet to fully theoretically understand TRPO and PPO, because they are > statistically quite heavy.. i mean the papers provide mechanical pseudocode, > but the intution on what is really happening is what i wish to understand.. > Towards this, i am trying to find blogs, and indeed the past few days have > gone in a beautiful RL blur! But it really has been so interesting.. if you > can provide some resources to understand the statistical intution behing > trust region algos, it would really be helpful! > > Right now, i am just looking at implementing a single threaded vanilla policy > gradient algorithm. I will look at > https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35 > > <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35>, > and see how i can use it! I am not even looking at actor critic right now, > and PPO for sure is the state of the art, but that's way beyond scope for me > right now.. > > I am attaching a screenshot of what I am aiming at implementing > What are your inputs on implementing this? > Would you say that if i refer to the file you have mentioned, it should be > doable, considering a single threaded environment? > > Thanks a lot again! > > > On Feb 19, 2018 10:12 PM, "Shangtong Zhang" <zhangshangtong....@gmail.com > <mailto:zhangshangtong....@gmail.com>> wrote: > Hi Chirag, > > I think it would be better to also cc the mail list. > > I assume you are trying to implement A3C or something likes this. > Actually this has almost been done. See my PR > https://github.com/mlpack/mlpack/pull/934 > <https://github.com/mlpack/mlpack/pull/934> > This is my work last summer. To compute the gradient, you can use > src/mlpack/methods/ann/layer/policy.hpp > <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35> > And there is also an actor_critic worker to show how to use this. > > The most annoying thing is it doesn’t work and I don’t know why. I and Marcus > tried hard but didn’t find any obvious logic bug. > So if you want tor implement A3C I think the simplest way is to find the bug. > I have some hints for you: > 1. Even we don’t have shared layers among actor and critic, A3C should work > well on small task like CartPole. If you do want shared layers, you need to > look into https://github.com/mlpack/mlpack/pull/1091 > <https://github.com/mlpack/mlpack/pull/1091> (I highly recommend you not to > do this first, as this is not critical) > 2. I believe the bug may lie in the async mechanism so it’s difficult to > debug (It’s possible I’m wrong). A good practice I think is to implement A2C > and corresponding PPO, which I believe is the state-of-the-art technique. You > can implement the vectorized environment, i.e. the interaction with the > environment is parallelized and synchronous, while the optimization occurs at > a single thread. See OpenAI baselines (tensorflow, > https://github.com/openai/baselines <https://github.com/openai/baselines>) or > my A2C (pytorch, > https://github.com/ShangtongZhang/DeepRL/blob/master/agent/A2C_agent.py > <https://github.com/ShangtongZhang/DeepRL/blob/master/agent/A2C_agent.py>) to > see how this idea works. I believe it’s much easier to implement and debug. > Once you implement the vectorized environment, it’s easy to plugin all the > algorithms, e.g. one/n-step q learning, n-step salsa, actor-critic and PPO. > From my experience, if tuned properly, the speed is comparable to fully async > implementations. > 3. If you do want A3C and want to find that bug. I think you can implement > actor-critic with experience replay first to verify if it works in single > thread case (Note this is wrong theoretically as to do this you need to use > off-policy actor-critic, while in practice you can just ignore the importance > sampling ratio and treat the data in the buffer as on-policy, it should work > and is enough to check the implementation in small task like CartPole) > > BTW your understanding about how forward and backward in DQN is absolutely > right. > > Hope this can help, > > Best regards, > > Shangtong Zhang, > Second year graduate student, > Department of Computing Science, > University of Alberta > Github <https://github.com/ShangtongZhang> | Stackoverflow > <http://stackoverflow.com/users/3650053/slardar-zhang> >> On Feb 19, 2018, at 00:58, Chirag Ramdas <chiragram...@gmail.com >> <mailto:chiragram...@gmail.com>> wrote: >> >> I think I can probably write a custom compute_gradients() method for my >> backprop here, but i wanted to know if mlpack's implementation provides me >> with something similar to a convenient Forward() + Backword() pair which i >> can use for my requirements here.. >> >> >> >> Yours Sincerely, >> >> Chirag Pabbaraju, >> B.E.(Hons.) Computer Science Engineering, >> BITS Pilani K.K. Birla Goa Campus, >> Off NH17B, Zuarinagar, >> Goa, India >> chiragram...@gmail.com <mailto:chiragram...@gmail.com> | +91-9860632945 >> >> On Mon, Feb 19, 2018 at 1:26 PM, Chirag Ramdas <chiragram...@gmail.com >> <mailto:chiragram...@gmail.com>> wrote: >> Hello, >> >> I had an implementation question to ask.. So from the neural network >> implementation i saw (ffn_impl), eg. lines 146-156 >> <https://github.com/chogba/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp#L146-L156> >> , you first forwarded the network on the states and saw what output (Q >> value) it was giving for each action.. thereafter, you updated the targets >> for the actions which you actually saw from your experience replay mechanism >> and this updated target matrix now behaves like your labels which you wanted >> the neural net to actually predict.. now, i saw from the q_learning_test.hpp >> file that you are initialising the FFN with MeanSquaredError, so i am >> assuming if you pass this target matrix to learningNetwork.Backward(), it >> computes the gradients of the mean squared error with respect to all the >> parameters. Thereafter, with these gradients and the optimizer which you >> have specified eg.Adam,etc, updater.Update() updates the parameters of the >> network. >> Do correct me if i was wrong anywhere.. >> >> So now my question is.. I am faced with a custom optimisation function, and >> i am required to compute gradients of this function with respect to each of >> the parameters of my neural net.. The Forward() + Backward() pair which was >> called in the above implementation required me to compute 1) what my network >> computes for an input 2) what i believe it should have computed, and >> thereafter computes the gradients by itself. But I simply have an objective >> function (no notion of what the network should have computed ie labels) and >> correspondingly an update rule which i want to follow.. >> >> Precisely, i have a policy function pi which is approximated by a neural net >> parameterised by theta, and which outputs the probabilities of performing >> each action given a state.. now, i want the following update rule for the >> parameters.. >> >> <Screen Shot 2018-02-19 at 1.10.32 PM.png> >> >> >> basically, i am asking if i can have my neural net optimise an objective >> function which i myself specify, in some form. >> I looked at the implementation of ffn, but i couldn't figure out how i could >> do this.. hope my question was clear.. >> >> Thanks a lot! >> >> Yours Sincerely, >> >> Chirag Pabbaraju, >> B.E.(Hons.) Computer Science Engineering, >> BITS Pilani K.K. Birla Goa Campus, >> Off NH17B, Zuarinagar, >> Goa, India >> chiragram...@gmail.com <mailto:chiragram...@gmail.com> | +91-9860632945 >> > > <Screenshot_20180218-212917.jpg>

_______________________________________________ mlpack mailing list mlpack@lists.mlpack.org http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack