Re: [mlpack] GSoC 2018 : Reinforcement Learning

Marcus Edel Sat, 03 Mar 2018 05:10:44 -0800

Hello Rajesh,

> The implementation of Prioritized action replay is the smallest among the 3
> ideas proposed as the idea is much simpler than the rest. So, Ideally, the
> implementation of Double DQN and duelling architectures should take somewhere
> between 2-3 months considering all components such as testing etc. And if
> there's time left after that the last extension can be added. Since it is a
> smaller addition and I would be fully familiar with mlpack by then, I think
> adding the last part can be done quickly and can even be done post-summer as I
> feel this component is quite useful to any RL library.


This sounds reasonable to me, I think every method you mentioned would fit into
the current codebase, so please feel free to choose the methods you find the
most interesting.

> While going through the code though I noticed something surprising: Sangtong
> Zhang has already implemented Dobule DQN. I saw it in this code :
> 
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_le
> arning/q_learning_impl.hpp.
> 
> Also, in one of the comments in the PR at
> https://github.com/mlpack/mlpack/pull/934  , he mentions testing Double DQN
> (comment on 27th May). So I wanted to know if there is something more that is
> required to be done as part of double DQN.

Ah right, we should close the PR to avoid any more confusions, this was just
used to track the overall process.

> If double DQN is already done, then I would propose that duelling architecture
> and noisy nets can be the main part of the project with prioritised action
> replay being the possible extension otherwise the older idea should be an
> achievable target.

Sounds good, note it's possible to improve/extend the existing Double DQN 
method.

> As you suggested, I went through the code to figure what and all can be 
> extended
> and was very happy to find that the overall code is well structured and hence
> can be well exploited for reuse, such as -

You are absolutely right, make sure to include that in your proposal.

> The timeline is something I feel that can be more flexible based on the
> progress. That is, if whatever that has been proposed does get completed 
> earlier
> than expected, then more features can be added (towards having all components 
> of
> Rainbow Algorithm) or if it goes a little slower than expected then I will
> ensure that I complete everything that was part of the proposal even post-
> summer.

Sounds reasonable, we should see if we can define a minimal set of goals, that
ideally should be finished by the end of the summer. Also, see
https://github.com/mlpack/mlpack/wiki/Google-Summer-of-Code-Application-Guide
for some tips.

I hope anything I said was helpful, let me know if I should clarify anything.

Thanks,
Marcus

> On 1. Mar 2018, at 13:16, ⁨яαנєѕн⁩ <⁨[email protected]⁩> wrote:
> 
> Hey Marcus, 
>  
> I think each idea you mentioned would fit into the existing codebase, but 
> don't
> underestimate the time you need to implement the method, writing good tests,
> etc. Each part is important and takes time, so my recommendation is to focus 
> on
> two ideas and maybe propose to work on another one or extend an idea if there 
> is
> time left.
> 
> I completely agree with this. It will be a lengthy project so I will propose 
> something on a smaller scale. 
> I actually was asking more about the fitting into the codebase part for which 
> I got the answer. Thank you. 
> 
> So, I was thinking the following can be done :
> 
> 1. Implementation of Double DQN
> 
> 2. Implementation of Duelling architecture DQN/ Noisy Nets paper - whichever 
> you think might be better
> 
> 3. Extensions if time permits: Prioritised action replay. The implementation 
> of Prioritized action replay is the smallest among the 3 ideas proposed as 
> the idea is much simpler than the rest. So, Ideally, the implementation of 
> Double DQN and duelling architectures should take somewhere between 2-3 
> months considering all components such as testing etc. And if there's time 
> left after that the last extension can be added. Since it is a smaller 
> addition and I would be fully familiar with mlpack by then, I think adding 
> the last part can be done quickly and can even be done post-summer as I feel 
> this component is quite useful to any RL library.
> 
> While going through the code though I noticed something surprising: Sangtong 
> Zhang has already implemented Dobule DQN. I saw it in this code :        
> 
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp
>  
> <https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp>.
> 
> Also, in one of the comments in the PR at 
> https://github.com/mlpack/mlpack/pull/934 
> <https://github.com/mlpack/mlpack/pull/934>  , he mentions testing Double DQN 
> (comment on 27th May). So I wanted to know if there is something more that is 
> required to be done as part of double DQN.
> 
> If double DQN is already done, then I would propose that duelling 
> architecture and noisy nets can be the main part of the project with 
> prioritised action replay being the possible extension otherwise the older 
> idea should be an achievable target.
> 
> As you suggested, I went through the code to figure what and all can be 
> extended and was very happy to find that the overall code is well structured 
> and hence can be well exploited for reuse, such as - 
> 
> The policies are separate and hence any change in the way the function 
> approximator is working will not affect the policy side of it. Hence, 
> https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/reinforcement_learning/policy
>  
> <https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/reinforcement_learning/policy>
>  can be used as is and can be very useful for testing new methods.
> 
> Same for the environment as well. 
> https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/reinforcement_learning/environment
>  
> <https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/reinforcement_learning/environment>
>  can be used as is. 
> 
> The replay, 
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/replay/random_replay.hpp
>  
> <https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/replay/random_replay.hpp>
>   , is something that will be extended in the prioritized action replay 
> method as the algorithm modifies that part of the algorithm. It will remain 
> the same in all the other parts of the implementation. 
> 
> We can Reuse most of what's in 
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp
>  
> <https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp>
>  and 
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning.hpp
>  
> <https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning.hpp>
>  but the network type will be different for both Duelling architecture and 
> noisy nets. But the other parts can be extended. 
> 
> The timeline is something I feel that can be more flexible based on the 
> progress. That is, if whatever that has been proposed does get completed 
> earlier than expected, then more features can be added (towards having all 
> components of Rainbow Algorithm) or if it goes a little slower than expected 
> then I will ensure that I complete everything that was part of the proposal 
> even post-summer. 
> 
> So, I would like to know what more is required as part of the proposal and 
> also if Double DQN was fully implemented or not.
> 
> Regards, 
> Rajesh D M
> 
> 
> 
> On Tue, Feb 27, 2018 at 3:27 AM, Marcus Edel <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello Rajesh,
> 
>> As you mentioned, I've been working on the new environment (Gridworld from
>> Sutton and Barto - it's a simple environment) for testing out.I think it is
>> ready but want to test it in the standard way. So could you please tell me 
>> how
>> exactly were the environments cartpole and mountain car tested/run in 
>> general so
>> that I can follow a similar procedure to see whatever I have done is correct 
>> or
>> not.
> 
> That sounds great, 
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/tests 
> <https://github.com/mlpack/mlpack/blob/master/src/mlpack/tests>
> /rl_components_test.cpp should be helpful.
> 
>> So, I think mlpack should have this latest state of the art available as 
>> part of
>> the library. It may not be possible to implement all of the above mentioned
>> techniques in 3 months but I feel they are not very hard to add either as 
>> they
>> are just extensions on top of each other (for most parts) and I would be 
>> also be
>> happy to continue contributing after the GSoC as well.
>> 
>> So, can we work towards Rainbow as the goal for GSoC (with few but not all
>> components). Will that be a good idea ?
> 
> Sounds like you already put some time into the project idea, that is great. I
> think each idea you mentioned would fit into the existing codebase, but don't
> underestimate the time you need to implement the method, writing good tests,
> etc. Each part is important and takes time, so my recommendation is to focus 
> on
> two ideas and maybe propose to work on another one or extend an idea if there 
> is
> time left. Also, another tip for the proposal is to mention the parts that can
> be reused or have to be extended over the summer, a clear structure of the
> project idea helps a lot.
> 
> I hope anything I said was helpful, let me know if I should clarify anything.
> 
> Thanks,
> Marcus
> 
>> On 26. Feb 2018, at 19:23, ⁨яαנєѕн⁩ <⁨[email protected] 
>> <mailto:[email protected]>⁩> wrote:
>> 
>> Hey Marcus, Rajesh here. 
>> 
>> As you mentioned, I've been working on the new environment (Gridworld from 
>> Sutton and Barto - it's a simple environment) for testing out.I think it is 
>> ready but want to test it in the standard way. So could you please tell me 
>> how exactly were the environments cartpole and mountain car tested/run in 
>> general so that I can follow a similar procedure to see whatever I have done 
>> is correct or not.
>> 
>> Also, with this I have gotten a good idea of how mlpack works and getting 
>> more and more used to it by the day. I also wanted to parallelly start 
>> working on the proposal.
>> 
>> I went through everything Shangtong Zhang had done last year as part of GSoC 
>> and learnt that DQN and async n-step q-learning are the major contributions 
>> with rest of his work revolving around them.
>> 
>> So I think the following can be extensions to his work which would fit well 
>> into existing architecture built by him :
>> 
>> 1. Double DQN (as suggested by you guys in the ideas list)
>> 
>> 2. Prioritized action replay : In this method, the samples are no longer 
>> selected at random for the replay buffer as they are in DQN method but are 
>> prioritized based on a parameter. One of the parameters is the TD-error. 
>> This method's results beat the results of Double DQN
>> 
>> 3. After this, Deep mind released their next improvisation :  Dueling 
>> architecture :
>>   
>> In this architecture,  the state values and the actions values from the 
>> state action function are separated in the neural net architecture and 
>> combined back before the last step. The intuition behind this is t hat the 
>> value of a state does not always depend only on the actions that can be 
>> taken from that state.
>> 
>> 4. They then came up with Noisy Nets: Another improvement while using all 
>> the above methods by adding noise to the neural net weights which in turn 
>> according to them improved the overall exploration efficiency.
>> 
>> They also had other improvements in Multi Step RL and Distributional RL.
>> 
>> After this is when they came up with their best algorithm:
>> 
>> Rainbow : It is a combination of all the above mentioned algorithms. They 
>> were able to combine all the algorithms as they all work on different parts 
>> of the learning of the RL agent (exploration,  policy update etc). The 
>> results of Rainbow far exceed the results of any of the other techniques out 
>> there. The paper also shows results of other combinations of the above 
>> mentioned methods.
>>  <http://www.mlpack.org/gsocblog/ShangtongZhangPage.html>
>> So, I think mlpack should have this latest state of the art available as 
>> part of the library. It may not be possible to implement all of the above 
>> mentioned techniques in 3 months but I feel they are not very hard to add 
>> either as they are just extensions on top of each other (for most parts) and 
>> I would be also be happy to continue contributing after the GSoC as well.
>> 
>> So, can we work towards Rainbow as the goal for GSoC (with few but not all 
>> components). Will that be a good idea ? 
>> 
>> I have already read all papers as part of my thesis work and actually 
>> working towards improving upon them and hence have a thorough understanding 
>> of all the concepts so I can start working on them at the earliest. 
>> 
>> PS: The other implementation of Proximal Policy Optimization Algorithms(PPO) 
>> is actually an improvement over Trust Region Policy Optimization (TRPO) so 
>> to implement PPO, TRPO might have to be implemented first. Also, that is in 
>> the domain of continuous action space and continuous state space (Rainbow 
>> and other techniques are can handle only continuous state space) and the 
>> other state of the art in that area is Deep Deterministic Policy 
>> Gradient(DDPG) . So if you want that to be part of mlpack, it'll probably be 
>> a good idea to implement those 3 together. I am equally interested in both 
>> sets of implementation (Have gone through all 3 of these papers also 
>> already) .
>> 
>> I personally feel going with the first set is better as Shangtong Zhang has 
>> created a great base for build up of new methods on top of it. Pleas let me 
>> know what you think about the same.
>> 
>> -- 
>> Regards,
>> Rajesh D M
>> <Distributional RL.pdf><Dueling Network Architectures for DeepRL.pdf><Noisy 
>> Networks for exploration 
>> RL.pdf><Prioritized_experience_replay.pdf><TrustRegionPolicyOptimisation.pdf>
> 
> 
> 
> 
> -- 
> Regards,
> Rajesh D M

_______________________________________________
mlpack mailing list
[email protected]
http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack

Re: [mlpack] GSoC 2018 : Reinforcement Learning

Reply via email to