I test against a reference (gnugo), and on cgos. Testing against the old version of the same bot can be misleading.
Against gnugo, I run it on the top level, 19x19, and use 2000 playouts rather than a time limit, for repeatability. The number of playouts is picked to give a win rate close to 50%, to get better statistics. When I started testing, with a much weaker bot, I tested on 9x9 with 3 minutes time limits. To test for statistical significance in the results I use an approximate formula for two standard deviation bounds: =1.96*SQRT(win-rate*(1-win-rate)/games-played). Typically I run 2000 to 4000 game matches, to get the bound below +- 2% CGOS is good to find bugs by playing a variety of opponents, but it takes a long time to get a significant number of games. Against gnugo I make changes and do tests once or twice a day, but it takes a week or more to get good results on CGOS. My experience now is that for every improvement to the program, I try at least 10 things that make it weaker. David From: [email protected] [mailto:[email protected]] On Behalf Of Steve Safarik Sent: Saturday, February 19, 2011 11:14 AM To: [email protected] Subject: [Computer-go] Assessing Improvements Suppose I develop what I think is an improved feature, for example a better influence function or some other. I'd like to hear people's thoughts on how to best & most quickly determine if it is in fact an improvement. Do I just take my new function and replace the equivalent function in something like Fuego, then have the two engines start playing games? My impression is that would be a rather slow way to get enough games to be of significance. Is there a better way to compare two engines? If that is indeed the method people generally use, how much time do you allow per move or game, and can you tell me your general experiences with doing this? Thanks. Steve.
_______________________________________________ Computer-go mailing list [email protected] http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
