Re: Temporal difference learning. Lambda parameter.

Nikos Papachristou Thu, 02 Jan 2020 02:10:15 -0800

 I do not have an explanation why large values of lambda diverged on
Plakoto. Selecting the correct lambda value is usually domain and/or
architecture specific. My experience in Portes/backgammon was that values
of lambda > 0 sped up learning at the first stages, hence the starting
value of 0.7. Back then (2011-2012), when a full training run would need at
least a week on my computer, it was a good way to reduce the training time
somewhat.


Nikos




On Sun, Dec 22, 2019 at 11:07 PM boomslang <[email protected]>
wrote:

> Nikolaos Papahristou used TD(lambda) for his Palamedes bot, which plays
> several Greek backgammon variants.  In "On the Design and Training of
> Bots to play Backgammon Variants", he writes:
>
> "In the Plakoto variant, values of λ>0.6 resulted in divergence, whereas
> lower values sometimes became unstable. So it was decided to keep λ=0 for
> this variant.
> For Portes and Fevga variants it was possible to increase the λ value
> without problems and this always resulted in faster learning, but unlike
> other reported results [16], final performance did not exceed experiments
> with λ=0."
>
> Portes is essentially the same as standard backgammon; the main
> differences are: (1) the absence of the doubling cube and (2) the absence
> of triple wins.
>
> He trained the Portes variant with lambda = 0.7 for the first 250k games,
> then proceeds with lambda = 0.
>
> Perhaps he can tell us why it works in one variant, but in others not?
>
>
>
>
>
>
>
>
> On Sunday, 22 December 2019, 00:09:57 CET, Philippe Michel <
> [email protected]> wrote:
>
>
> On Sat, Dec 14, 2019 at 01:12:34PM +0100, Øystein Schønning-Johansen wrote:
>
> > The reinforcement learning that has been used up til now is plain
> temporal
> > difference learning like described in Sutton and Barto (and done by
> several
> > science projects) with TD(lambda=0).
>
> I don't think this is the case (or the definiton of TD is much wider
> than what I thought).
>
> The 1.0 version uses straightforward supervised training on a rolled-out
> database.
>
> I wasn't involved at the time, but as far as I know :
>
> Earlier versions, by Joseph Heled, used supervised training on a
> database evaluated at 2-ply.
>
> The very first versions by Gary Wong did indeed use TD training but this
> was abandonned when it seemed stuck at an intermediate level of play
> (but the problem was probably not due to the training method since
> TD-Gammon before that and BGBlitz since then did very well with TD).
>
>
> > Do you think that the engine can be better at planning ahead, if lambda
> is
> > increased? Has anyone done a lot of experiments with lambda other than 0?
> > (I don't think it's code in the repo to do anything else than lambda=0,
> so
> > maybe someone with some other research code base on this can answer?) Or
> > someone with general knowledge of RL can answer?
>
>
> The engine doesn't "plan ahead", does it ? It approximates the
> probabilities of the game outcomes from the current position (or we can
> say its equity for simplification).
>
> My understanding is that its potential accuracy depends on the neural
> network (architecture + input features) and the training method
> (including the training database in the case of supervised learning) has
> influence on how close to this potential one can go, and how fast.
>
>

Re: Temporal difference learning. Lambda parameter.

Reply via email to