I do not have an explanation why large values of lambda diverged on Plakoto. Selecting the correct lambda value is usually domain and/or architecture specific. My experience in Portes/backgammon was that values of lambda > 0 sped up learning at the first stages, hence the starting value of 0.7. Back then (2011-2012), when a full training run would need at least a week on my computer, it was a good way to reduce the training time somewhat.
Nikos On Sun, Dec 22, 2019 at 11:07 PM boomslang <[email protected]> wrote: > Nikolaos Papahristou used TD(lambda) for his Palamedes bot, which plays > several Greek backgammon variants. In "On the Design and Training of > Bots to play Backgammon Variants", he writes: > > "In the Plakoto variant, values of λ>0.6 resulted in divergence, whereas > lower values sometimes became unstable. So it was decided to keep λ=0 for > this variant. > For Portes and Fevga variants it was possible to increase the λ value > without problems and this always resulted in faster learning, but unlike > other reported results [16], final performance did not exceed experiments > with λ=0." > > Portes is essentially the same as standard backgammon; the main > differences are: (1) the absence of the doubling cube and (2) the absence > of triple wins. > > He trained the Portes variant with lambda = 0.7 for the first 250k games, > then proceeds with lambda = 0. > > Perhaps he can tell us why it works in one variant, but in others not? > > > > > > > > > On Sunday, 22 December 2019, 00:09:57 CET, Philippe Michel < > [email protected]> wrote: > > > On Sat, Dec 14, 2019 at 01:12:34PM +0100, Øystein Schønning-Johansen wrote: > > > The reinforcement learning that has been used up til now is plain > temporal > > difference learning like described in Sutton and Barto (and done by > several > > science projects) with TD(lambda=0). > > I don't think this is the case (or the definiton of TD is much wider > than what I thought). > > The 1.0 version uses straightforward supervised training on a rolled-out > database. > > I wasn't involved at the time, but as far as I know : > > Earlier versions, by Joseph Heled, used supervised training on a > database evaluated at 2-ply. > > The very first versions by Gary Wong did indeed use TD training but this > was abandonned when it seemed stuck at an intermediate level of play > (but the problem was probably not due to the training method since > TD-Gammon before that and BGBlitz since then did very well with TD). > > > > Do you think that the engine can be better at planning ahead, if lambda > is > > increased? Has anyone done a lot of experiments with lambda other than 0? > > (I don't think it's code in the repo to do anything else than lambda=0, > so > > maybe someone with some other research code base on this can answer?) Or > > someone with general knowledge of RL can answer? > > > The engine doesn't "plan ahead", does it ? It approximates the > probabilities of the game outcomes from the current position (or we can > say its equity for simplification). > > My understanding is that its potential accuracy depends on the neural > network (architecture + input features) and the training method > (including the training database in the case of supervised learning) has > influence on how close to this potential one can go, and how fast. > >
