On Wed, 2008-10-15 at 10:13 +0200, Denis fidaali wrote: > ---------------------------------------------- > Don Said > ---------------------------------------------- > An external tester will test for conformance and it will compare 2 bots, > one of which we "trust" as being conforming. But the tester will not > be deterministic, it will throw random positions at the bots so that a > black box author cannot present it with something that has hard coded > answers. Also, I don't want it to be tuned for any given position. > > +++++++++++++++++++++ > Response : > +++++++++++++++++++++ > i definitely agree with all that. I guess rather than just yes or no, > the test will output a probability of conformance :)
Something like that. I'm not exactly sure how to compute the probability of conformance so I may ask for some help from the more mathematically oriented people on this list. I can calculate the standard deviation of the score to start with. But really I would like to somehow convert all the relevant statistics to something that looks like probability of conformance. It did occur to me that I could do this monte carlo style, but that's compute intensive and not the best way to do it. Imagine running the test thousands of times and then charting the various possible outcomes using a trusted reference program. It would be easy to find what scores (high and low) were outside of the 99 percentile for instance. But this is really not the right way to do this, especially if the test is not deterministic. > ---------------------------------------------- > Don Said > ---------------------------------------------- > I envision that you might be able to seed the tester with a random > number in order to duplicate the testing conditions and if this becomes > structured enough the "official" test would use a hidden standard seed. > This would be required so that different programs are not presented with > a different series of tests. > > +++++++++++++++++++++ > Response : > +++++++++++++++++++++ > I'm not sure i really get the point. Or are you thinking past the > "light playout" contest ? What's the problem with programs being > presented different series of tests ? You would be free to present the program with different series of tests and that would work just fine until you got to the performance measurement test, which should be consistent. I'm just saying I don't want to run one set of position for one program and then a totally different set of positions for another and use them directly to compare performance. The performance test itself of course does not have to be a secret, but I kind of liked the idea that you can't micro-tweak for a specific test. But after some reflection it should probably be public for verifiability. > > ---------------------------------------------- > Don Said > ---------------------------------------------- > GTP is pretty much a necessity and is also very much a standard. It's > necessary for external game playing and testing and for having an > external tester. It doesn't make sense to produce a different system. > We could make it spit out numbers that you key in to a spreadsheet of > some kind or you could do the math manually, but this becomes unwieldy > if the test is to be very sophisticated. > +++++++++++++++++++++ > Response : > +++++++++++++++++++++ > I agree that we need the program > to communicate with the outside world. > I agree that GTP is standard. I don't really think > we should avoid it. > I disagree if this means we shouldn't try to think out > a system that would help people to take advantage > of this standard in a faster and easier way. That do not mean we should > indeed implement anything. Except if it's truly worth > it. Still my opinion is that there is room for discussions there. Yes, it's possible to build another protocol that is fine tuned for this kind of test. But I still want the ability to auto-test various implementation of this and GTP is a no-brainer here. How about this. We can define a simple way to run a test manually based on the opening position (we cannot use a variety of positions without a standardized interface and I don't want to work out a brand new protocol just for this.) > > ---------------------------------------------- > Don Said > ---------------------------------------------- > In fact, this is the whole "raison d'être" (reason for being) of GTP, > for communicating with programs. > +++++++++++++++++++++ > Response : > +++++++++++++++++++++ > I'm french by the way :p > Well, i disagree. That's why GTP is > standard : it allows for communicating > with programs. It is a protocol of > communication with programs .. > Still it was engineered for the needs > of gnugo (in particular regression testing i think). > > When i look at my code, I use a sort > of own made communication standard. > I use it as a set of call to function. > And it's engineered to be a whole more > easier than to output directly in the GTP > format. > > For example, when i output a move, > i just output a number. Not a vertex per-se. > I have the feeling that a lot of people would > use a representation like this one i use. Where > you give a number to each intersection > for example from 1 to 81. I use 0 as the > pass message. And -1 as the resign message. > > It allows me to concentrate more on what is > meaningful .. then i have a layer that translate > all that to effective GTP-commands. Now it may > be (or not) that this 1..81 representation do indeed reduce > the thinking and testing time for interacting with GTP. > It does for me. So i'd be happy to know if it would do for others > too. I know that it was not very enjoyable for me to spend > so much time on the GTP part. What i would like is to come > up with a "better" way of representing things. That would be easier > and more natural to implement. I have made that very informally > on my own systems. Still it needs a lot more tunning and criticizing. > > So the plan was to do exactly what i do : propose for the contenders > a way of handling messages and responses easier to implement. Then > add a little external module to translate those into well-formed GTP commands. > > Suppose you have a GTP server, let's say GOGUI. Then you would pass it, > the translater as the GO-PROGRAM. The translater would take as an argument > the effective program. And execute it as is. > > Then there is the speed problem. We have great tools for testing programs > agains each others. In particular with the gogui test-suit. (i don't know what > CGOS uses). Still even for instantly generated moves (10 000 per seconds ..) > it takes a few dozen seconds to get a game done, with the results. I still use > it because it's so handy. But for fast generated move tests, it's really too > slow. > (now i wonder if my GTP tunnel wouldn't be the why it takes so much time, > or of this is due to the server). Still there clearly is room for something > faster there. > But i do not say that it's really worthy re-engineering all the tools. Only > that it may be > worth discussing if there is an easy way to make this faster :) > > > === > So to resume, i have two issues : > - it takes to much time to get a GTP engine right. (The point) > - it's slow to use two_gtp with two fast move generator for statistical > regression testing. (Alternate discussion effort ) > === > Therefore i wonder what solution we could come up to, even if it's clear > that it's not worthy that someone implements it :) > === I'll defer to the rest of the program authors on this one. Let's see if we get any comments on whether people are willing to add another protocol that is simpler as an extra layer between GTP and their engines. If you can lay it out for us and post it to the list it would be helpful. > > > ---------------------------------------------- > Don Said > ---------------------------------------------- > We can always publish numbers that people can use for informal checking. > But I definitely want some kind of conformance metric that is not just > ad-hoc. > +++++++++++++++++++++ > Response : > +++++++++++++++++++++ > I agree with your statistical metric of > random generated positions > and comparison of the scoring done there. > In fact it would be probably sufficient as a black box. > > The program-vs-program still adding more confidence :) > > > ---------------------------------------------- > Don Said > ---------------------------------------------- > I really like your idea of massive automated testing to test > conformance, but you know this is extremely CPU intensive. > +++++++++++++++++++++ > Response : > +++++++++++++++++++++ > In fact .. no i don't know :) > I have made up some numbers > and it seems that indeed it would take > a few hours at best. (for 2000 games) 2000 is not enough. It takes about 100,000 games to get within 1 or 2 elo or something like that. For programs that are sure to be very close it strength even if they are not implemented the same, we might need that many games to detect a consistent bias with serious confidence. This has a good chance of detecting flaws with the RNG too if you self play the same program against itself given that we know the expected score. - Don > > > ---------------------------------------------- > Don Said > ---------------------------------------------- > It would > take tens or hundreds of thousands of games to be able to say with high > confidence that 2 programs are functionally identical in strength. So > I envision a primary test that runs relatively quickly and a more > comprehensive test based on game play for the most interesting programs > or for anyone will to take it that far. > +++++++++++++++++++++ > Response : > +++++++++++++++++++++ > I think 2000 games would be more than enough to prove > a near 50% win ratio. > > The position scoring test can be fast. I suspect that it is also nearly > enough to do only that. > We can design it so it takes about 10 minutes per test suite. > I think it is enough to generate a few positions (how much exactly) then > asking both the reference bot, and the to-be-tested to score each legal > moves of it. - with a number of simulation high enough to get some > reproductibility - > Then we can get a confidence bound on how much they are alike. > As an added bonus, if the number of simulation is high enough, > the server can also time the speed, without the network and communication > latency impacting too much. (which may be of limited value, as the hardware > wouldn't be tractable but well) > > _________________________________________________________________ > Téléphonez gratuitement à tous vos proches avec Windows Live Messenger ! > Téléchargez-le maintenant ! > http://www.windowslive.fr/messenger/1.asp_______________________________________________ > computer-go mailing list > [email protected] > http://www.computer-go.org/mailman/listinfo/computer-go/
signature.asc
Description: This is a digitally signed message part
_______________________________________________ computer-go mailing list [email protected] http://www.computer-go.org/mailman/listinfo/computer-go/
