From: Brian Sheppard <[email protected]>
> Measuring small differences is a big problem for me. I would like to have
>better tools here.
> For instance, I am trying to measure whether a particular rule is an
>improvement, where with the rule it wins 60.5%, and without 60.0%. You need a
>staggering number of games to establish confidence. Yet this is the small, 5
>to
>10 Elo gain that Don referred to.
> I hoped to isolate cases where the *move* differs between versions, and then
>analyze (perhaps using a standard oracle like Fuego) whether those moves are
>plusses or minuses. But this is MCTS, and the program does not always play the
>same way even in the same position.
A very tough problem! How many is "a staggering number", just out of curiosity?
I believe at least one developer is using a network of idle workstations to run
tests. Is anybody using Amazon or some other cloud service? I recently read
where a firm rented 10,000 cores for 8 hours for $8000 - a princely sum, but it
does scale down as well as up.
Sadly, Fuego ( or any existing program ) may not be a very good "oracle" to
determine whether move A or move B is best in a given situation.
Does anybody have experience with testing particular "hard cases", rather than
"1000 random games from scratch"?
That is, based on past experience, program X did move A in situation Y, which
turned out to be a disaster. Strong players suggest that B, C, or D would be
better.
There are more than a few such "what was the player thinking?" instances in the
archives.
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go