Hi Following Arend advice, gg378 and twin-378 had a 85 games endgame-match: - twin 26 win (1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 5 7 10 14 15 21 25 28) - GNU Go 14 win(-9 -3 -3 -3 -2 -2 -2 -1 -1 -1 -1 -1 -1 -1) - 45 unchanged The sum is +135, the average on 85 games +1.6
_but_ when one looks at the attached plot of cumulative +PASS -FAIL versus game_status, the twin fails a lot of end-game tests (game_status>0.85). It is already a huge task to check big failures, but i feel too lazy to investigate this 40 tests and more than 50 regressions in endgame, (and i am a very bad yose player ;-) By construction, the twin "knows" exactly how gg378 evaluates the game, and the twin may steal a big point before gg378 plays it, but it is still gnugo-logic. So i wonder if this endgame match is significant or if it is just a systematic error. In other words, a reliable endgame comparison should imply an other engine, good at endgame, and compare the results of both against the reference engine. Am i right, or just paranoid ? Is there such an engine available ? - Alain PS: the plot include all boardsizes, it is not so flat when separating them, but i have made too much clean-up, and erased the results, so ... i re run regression tests again :(
twin4-d1.5_cumul+P-F_vs_gstatus.png
Description: PNG image
_______________________________________________ gnugo-devel mailing list gnugo-devel@gnu.org http://lists.gnu.org/mailman/listinfo/gnugo-devel