Yes. First try the vanilla implementation, if it doesn’t work augment it with 
experience replay (ER).
However I would suggest not to merge your vanilla implementation with ER, 
because it’s wrong theoretically as I mentioned before. I would also suggest 
not to merge your vanilla implementation without ER, as I’m pretty sure it 
won’t work for large network and large task.

Anyway it’s a good starting point to prove you are good at this. And if you 
want it to be merged, you can implement policy-gradient + ER + importance 
sampling ratio, which is theoretically right but may be unstable. You can 
truncate the importance sampling ratio to make it stable (although it 
introduces bias, it’s acceptable).

Shangtong Zhang,
Second year graduate student,
Department of Computing Science,
University of Alberta
Github <> | Stackoverflow 
> On Feb 19, 2018, at 10:26, Chirag Ramdas <> wrote:
> I see.. so what i can probably do is, i can use experience replay mechanism 
> along with this vanilla implementation.. this should intutively work for a 
> single thread worker right? how does that sound for a start?
> On Feb 19, 2018 10:52 PM, "Shangtong Zhang" < 
> <>> wrote:
> For TRPO you need to read the original paper.. I don’t have better idea.
> Starting from a vanilla policy gradient is good, however the main concern is 
> that from my experience, you need either experience replay or multi-workers 
> to make a non-linear function approximator work (they can give you 
> uncorrelated data, which is crucial to train a network). Without them it may 
> be hard to tune (although it’s possible if you work on small network and 
> small task, it’s worth a trial). 
> Shangtong Zhang,
> Second year graduate student,
> Department of Computing Science,
> University of Alberta
> Github <> | Stackoverflow 
> <>
>> On Feb 19, 2018, at 10:13, Chirag Ramdas < 
>> <>> wrote:
>> Hi Shangtong,
>> Thank you so very much for the detailed reply, I appreciate it a lot!
>> I spoke to Marcus about an initial contribution to make my GSoC proposal 
>> strong, and he suggested me that i could implement a vanilla stochastic 
>> policy gradients implementation.. So i was looking to implement a vanilla 
>> implementation with a monte carlo value estimate as my advantage function - 
>> basically just the simplest of implementations...
>> I am yet to fully theoretically understand TRPO and PPO, because they are 
>> statistically quite heavy.. i mean the papers provide mechanical pseudocode, 
>> but the intution on what is really happening is what i wish to understand.. 
>> Towards this, i am trying to find blogs, and indeed the past few days have 
>> gone in a beautiful RL blur! But it really has been so interesting.. if you 
>> can provide some resources to understand the statistical intution behing 
>> trust region algos, it would really be helpful!
>> Right now, i am just looking at implementing a single threaded vanilla 
>> policy gradient algorithm. I will look at 
>> <>,
>>  and see how i can use it! I am not even looking at actor critic right now, 
>> and PPO for sure is the state of the art, but that's way beyond scope for me 
>> right now.. 
>> I am attaching a screenshot of what I am aiming at implementing
>> What are your inputs on implementing this?
>> Would you say that if i refer to the file you have mentioned, it should be 
>> doable, considering a single threaded environment?
>> Thanks a lot again!
>> On Feb 19, 2018 10:12 PM, "Shangtong Zhang" < 
>> <>> wrote:
>> Hi Chirag,
>> I think it would be better to also cc the mail list.
>> I assume you are trying to implement A3C or something likes this.
>> Actually this has almost been done. See my PR  
>> <>
>> This is my work last summer. To compute the gradient, you can use 
>> src/mlpack/methods/ann/layer/policy.hpp 
>> <>
>> And there is also an actor_critic worker to show how to use this.
>> The most annoying thing is it doesn’t work and I don’t know why. I and 
>> Marcus tried hard but didn’t find any obvious logic bug.
>> So if you want tor implement A3C I think the simplest way is to find the bug.
>> I have some hints for you:
>> 1. Even we don’t have shared layers among actor and critic, A3C should work 
>> well on small task like CartPole. If you do want shared layers, you need to 
>> look into 
>> <> (I highly recommend you not to 
>> do this first, as this is not critical)
>> 2. I believe the bug may lie in the async mechanism so it’s difficult to 
>> debug (It’s possible I’m wrong). A good practice I think is to implement A2C 
>> and corresponding PPO, which I believe is the state-of-the-art technique. 
>> You can implement the vectorized environment, i.e. the interaction with the 
>> environment is parallelized and synchronous, while the optimization occurs 
>> at a single thread. See OpenAI baselines (tensorflow, 
>> <>) 
>> or my A2C (pytorch, 
>> <>) 
>> to see how this idea works. I believe it’s much easier to implement and 
>> debug. Once you implement the vectorized environment, it’s easy to plugin 
>> all the algorithms, e.g. one/n-step q learning, n-step salsa, actor-critic 
>> and PPO. From my experience, if tuned properly, the speed is comparable to 
>> fully async implementations.
>> 3. If you do want A3C and want to find that bug. I think you can implement 
>> actor-critic with experience replay first to verify if it works in single 
>> thread case (Note this is wrong theoretically as to do this you need to use 
>> off-policy actor-critic, while in practice you can just ignore the 
>> importance sampling ratio and treat the data in the buffer as on-policy, it 
>> should work and is enough to check the implementation in small task like 
>> CartPole)
>> BTW your understanding about how forward and backward in DQN is absolutely 
>> right.
>> Hope this can help,
>> Best regards,
>> Shangtong Zhang,
>> Second year graduate student,
>> Department of Computing Science,
>> University of Alberta
>> Github <> | Stackoverflow 
>> <>
>>> On Feb 19, 2018, at 00:58, Chirag Ramdas < 
>>> <>> wrote:
>>> I think I can probably write a custom compute_gradients() method for my 
>>> backprop here, but i wanted to know if mlpack's implementation provides me 
>>> with something similar to a convenient Forward() + Backword() pair which i 
>>> can use for my requirements here..
>>> Yours Sincerely,
>>> Chirag Pabbaraju,
>>> B.E.(Hons.) Computer Science Engineering,
>>> BITS Pilani K.K. Birla Goa Campus,
>>> Off NH17B, Zuarinagar,
>>> Goa, India
>>> <> | +91-9860632945
>>> On Mon, Feb 19, 2018 at 1:26 PM, Chirag Ramdas < 
>>> <>> wrote:
>>> Hello,
>>> I had an implementation question to ask.. So from the neural network 
>>> implementation i saw (ffn_impl), eg.  lines 146-156 
>>> <>
>>>  , you first forwarded the network on the states and saw what output (Q 
>>> value) it was giving for each action.. thereafter, you updated the targets 
>>> for the actions which you actually saw from your experience replay 
>>> mechanism and this updated target matrix now behaves like your labels which 
>>> you wanted the neural net to actually predict.. now, i saw from the 
>>> q_learning_test.hpp file that you are initialising the FFN with 
>>> MeanSquaredError, so i am assuming if you pass this target matrix to 
>>> learningNetwork.Backward(), it computes the gradients of the mean squared 
>>> error with respect to all the parameters. Thereafter, with these gradients 
>>> and the optimizer which you have specified eg.Adam,etc,  updater.Update() 
>>> updates the parameters of the network.
>>> Do correct me if i was wrong anywhere..
>>> So now my question is.. I am faced with a custom optimisation function, and 
>>> i am required to compute gradients of this function with respect to each of 
>>> the parameters of my neural net.. The Forward() + Backward() pair which was 
>>> called in the above implementation required me to compute 1) what my 
>>> network computes for an input 2) what i believe it should have computed, 
>>> and thereafter computes the gradients by itself. But I simply have an 
>>> objective function (no notion of what the network should have computed ie 
>>> labels) and correspondingly an update rule which i want to follow..
>>> Precisely, i have a policy function pi which is approximated by a neural 
>>> net parameterised by theta, and which outputs the probabilities of 
>>> performing each action given a state.. now, i want the following update 
>>> rule for the parameters..
>>> <Screen Shot 2018-02-19 at 1.10.32 PM.png>
>>> basically, i am asking if i can have my neural net optimise an objective 
>>> function which i myself specify, in some form.
>>> I looked at the implementation of ffn, but i couldn't figure out how i 
>>> could do this.. hope my question was clear.. 
>>> Thanks a lot!
>>> Yours Sincerely,
>>> Chirag Pabbaraju,
>>> B.E.(Hons.) Computer Science Engineering,
>>> BITS Pilani K.K. Birla Goa Campus,
>>> Off NH17B, Zuarinagar,
>>> Goa, India
>>> <> | +91-9860632945
>> <Screenshot_20180218-212917.jpg>

mlpack mailing list

Reply via email to