[computer-go] Automated external light simulation validation.

Denis fidaali Wed, 15 Oct 2008 01:13:48 -0700

----------------------------------------------
Don Said
----------------------------------------------
But first of all, thank you for you generous praise, I probably don't
deserve it (but I'll take what I can get :-)


+++++++++++++++++++++
Response :
+++++++++++++++++++++
It's hard to argument about that :)
Still you have been a constant presence
on this list. As i recall it, you may
well be the person that have the
top post per day score :)
It doesn't automatically mean that all
is good, but still that's something that gives
a feeling of security. If you are involved, we know
that (the project) can easily sustend the test of time :)
(if it's worthy enough that is). That may be why i so much
want you to feel good about it :) Still i probably think all
that i said.

----------------------------------------------
Don Said
----------------------------------------------
An external tester will test for conformance and it will compare 2 bots,
one of which we "trust" as being conforming.   But the tester will not
be deterministic, it will throw random positions at the bots so that a
black box author cannot present it with something that has hard coded
answers.   Also, I don't want it to be tuned for any given position.  

+++++++++++++++++++++
Response :
+++++++++++++++++++++
i definitely agree with all that. I guess rather than just yes or no,
the test will output a probability of conformance :)

----------------------------------------------
Don Said
----------------------------------------------
I envision that you might be able to seed the tester with a random
number in order to duplicate the testing conditions and if this becomes
structured enough the "official" test would use a hidden standard seed.
This would be required so that different programs are not presented with
a different series of tests.  

+++++++++++++++++++++
Response :
+++++++++++++++++++++
 I'm not sure i really get the point. Or are you thinking past the
"light playout" contest ? What's the problem with programs being
presented different series of tests ?

----------------------------------------------
Don Said
----------------------------------------------
GTP is pretty much a necessity and is also very much a standard.  It's
necessary for external game playing and testing and for having an
external tester.  It doesn't make sense to produce a different system.
We could make it spit out numbers that you key in to a spreadsheet of
some kind or you could do the math manually, but this becomes unwieldy
if the test is to be very sophisticated.   
+++++++++++++++++++++
Response :
+++++++++++++++++++++
I agree that we need the program
to communicate with the outside world.
I agree that GTP is standard. I don't really think
we should avoid it.
I disagree if this means we shouldn't try to think out
a system that would help people to take advantage
of this standard in a faster and easier way. That do not mean we should
 indeed implement anything. Except if it's truly worth
it. Still my opinion is that there is room for discussions there.

----------------------------------------------
Don Said
----------------------------------------------
In fact, this is the whole "raison d'être" (reason for being) of GTP,
for communicating with programs. 
+++++++++++++++++++++
Response :
+++++++++++++++++++++
I'm french by the way :p
Well, i disagree. That's why GTP is 
standard : it allows for communicating
with programs. It is a protocol of 
communication with programs ..
Still it was engineered for the needs
of gnugo (in particular regression testing i think). 

When i look at my code, I use a sort
of own made communication standard.
I use it as a set of call to function.
And it's engineered to be a whole more
easier than to output directly in the GTP
format.

For example, when i output a move,
i just output a number. Not a vertex per-se.
I have the feeling that a lot of people would
use a representation like this one i use. Where
you give a number to each intersection
for example from 1 to 81. I use 0 as the
pass message. And -1 as the resign message.

It allows me to concentrate more on what is
meaningful .. then i have a layer that translate
all that to effective GTP-commands. Now it may
be (or not) that this 1..81 representation do indeed reduce
the thinking and testing time for interacting with GTP.
It does for me. So i'd be happy to know if it would do for others
too. I know that it was not very enjoyable for me to spend
so much time on the GTP part. What i would like is to come
up with a "better" way of representing things. That would be easier
and more natural to implement. I have made that very informally
on my own systems. Still it needs a lot more tunning and criticizing.

So the plan was to do exactly what i do : propose for the contenders
a way of handling messages and responses easier to implement. Then
add a little external module to translate those into well-formed GTP commands.

Suppose you have a GTP server, let's say GOGUI. Then you would pass it,
the translater as the GO-PROGRAM. The translater would take as an argument
the effective program. And execute it as is.

Then there is the speed problem. We have great tools for testing programs
agains each others. In particular with the gogui test-suit. (i don't know what
CGOS uses). Still even for instantly generated moves (10 000 per seconds ..)
it takes a few dozen seconds to get a game done, with the results. I still use
it because it's so handy. But for fast generated move tests, it's really too 
slow.
(now i wonder if my GTP tunnel wouldn't be the why it takes so much time,
or of this is due to the server). Still there clearly is room for something 
faster there.
But i do not say that it's really worthy re-engineering all the tools. Only 
that it may be
worth discussing if there is an easy way to make this faster :)


===
So to resume, i have two issues :
- it takes to much time to get a GTP engine right. (The point)
- it's slow to use two_gtp with two fast move generator for statistical
regression testing. (Alternate discussion effort )
===
Therefore i wonder what solution we could come up to, even if it's clear
that it's not worthy that someone implements it :)
===


----------------------------------------------
Don Said
----------------------------------------------
We can always publish numbers that people can use for informal checking.
But I definitely want some kind of conformance metric that is not just
ad-hoc.  
+++++++++++++++++++++
Response :
+++++++++++++++++++++
I agree with your statistical metric of
random generated positions
and comparison of the scoring done there.
In fact it would be probably sufficient as a black box.

The program-vs-program still adding more confidence :)


----------------------------------------------
Don Said
----------------------------------------------
I really like your idea of massive automated testing to test
conformance,  but you know this is extremely CPU intensive. 
+++++++++++++++++++++
Response :
+++++++++++++++++++++
In fact .. no i don't know :)
I have made up some numbers
and it seems that indeed it would take
a few hours at best. (for 2000 games)


----------------------------------------------
Don Said
----------------------------------------------
 It would
take tens or hundreds of thousands of games to be able to say with high
confidence that 2 programs are functionally identical in strength.   So
I envision a primary test that runs relatively quickly and a more
comprehensive test based on game play for the most interesting programs
or for anyone will to take it that far.
+++++++++++++++++++++
Response :
+++++++++++++++++++++
I think 2000 games would be more than enough to prove
a near 50% win ratio.

The position scoring test can be fast. I suspect that it is also nearly enough 
to do only that.
We can design it so it takes about 10 minutes per test suite.
I think it is enough to generate a few positions (how much exactly) then
asking both the reference bot, and the to-be-tested to score each legal
moves of it. - with a number of simulation high enough to get some 
reproductibility -
 Then we can get a confidence bound on how much they are alike.
As an added bonus, if the number of simulation is high enough,
the server can also time the speed, without the network and communication
latency impacting too much. (which may be of limited value, as the hardware
wouldn't be tractable but well)

_________________________________________________________________
Téléphonez gratuitement à tous vos proches avec Windows Live Messenger  !  
Téléchargez-le maintenant !
http://www.windowslive.fr/messenger/1.asp_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Automated external light simulation validation.

Reply via email to