On 24.07.2010 00:24, Don Dailey wrote:
On Fri, Jul 23, 2010 at 3:54 PM, Raymond Wold
<[email protected] <mailto:[email protected]>> wrote:
On 23.07.2010 02:12, Stefan Kaitschick wrote:
Why should the worst case be the most interesting?
In a program of this complexity worst case isn't the "true"
strength of the program.
Worst case is basically a bug.
Given enough time, even an "AI" that chooses entirely randomly
from the legal moves available will get a win against the best human.
I'm trying to follow this conversation, but I don't get your point on
this.
It was a point about ignoring poor performance and only looking at the
best results, which can justify pretty much anything as long as there is
some variation in the performance.
What's wrong with looking average play to judge the program?
Well, it depends on how you measure the average. The typical,
putting your bot on a go server and letting it play self-selected
humans, is not very good, as surprise at playing style, no
knowledge of the program's fundamental flaws, and so on, (not to
mention humans not necessarily taking a game against a bot as
serious) will bias it towards the program.
Playing strength is a function of everything you know and don't know
and all your strengths and weaknesses and how you put them together.
Every player, even the very best has weaknesses but those do not
determine how strong he is.
These two sentences contradict eachother to me. If a program plays as a
9dan professional as long as there are never any ladders, but has a bug
that flips the status of every ladder, then that weakness /does/
determine how strong he is. As incredible as it would be to make such a
program, it would not deserve to be classed along with the human 9dan
professionals, it would probably be more correct to consider it double
digit kyu, as anyone who knows about ladders could make one, or several,
and practically seal any game once they know of the flaw. It 's entirely
conceivable that such a fault could go undetected at least for a few
games, or that the program could get wins against professionals that
didn't know about it.
I think your cognitive mental model of how this works is broken.
I didn't mention any cognitive mental model. I am speaking of what I
consider fair rank for computer programs.
I used to believe that a chain is as strong as it's weakest link -
using this analogy to improve my tennis game. But a pro I was
friends with taught me a pretty valuable lesson. He told me that my
weaknesses do not have to define my game and he taught me how to be
aggressive with my strengths and thus minimize the weaknesses. He
gave me several examples of top pro's that were far from the best at
certain things, but played in such a way that is was only a minor
issue and they reached the very top ranks. Then I remembered that
a chess master basically told me the same thing.
Well, it is fairly hard to make a computer program recognize and try to
minimize its weaknesses. You, as the programmer, can recognize it and
try to mitigate it, and that will probably increase its rank even under
my terms (and would please me no end). As long as you truly minimize the
effect of the weaknesses, so it's not a matter of just delaying the time
it takes for a human playing a long series of games to discover the
weaknesses and exploit them.
I would not object to an average of, say, 100 games against one
human opponent trying his best to win. With an even result under
such a series I would certainly consider the program as strong as
the human.
Within 2 or 3 games the human as learned 80% of what he needs to know.
Since most players who will play your program have already played
many others that have the same basic characteristics, my conclusion
is that computer strength as measured by ELO in KGS games is accurate
because they have already taken the hit for this particular weakness.
With "simple" bugs like not reading ladders, or flaws in life-and-death
analysis or patterns, I would agree. I am not however convinced that
todays go programs do not have more subtle weaknesses when it comes to
playing style, that it will take more than a few games to determine.
Your idea that there is some kind of break even point around game 100
is completely ludicrous.
I am not sure what you are referring to here. My point about even result
after a hundred games was in the sense of two players playing a hundred
games (or a thousand, the number isn't the point,) against eachother,
and each player winning around half, counting as a more trustworthy
measure of them being equal in strength when one of the players is a
program.
Basically your idea of fair is that the first few games shouldn't
count - you just said it differently and it's a ridiculous idea.
That is indeed what I am saying, and I don't think it is so ridiculous.
I cannot tell you how often I have played some opponent that I could
not beat the first few times until I leaned how to play him (in chess)
and it has also happened just the opposite for me where I seemed to
win easily but it was clear that my opponent was studying and
learning. Your comments suggest the first few games are invalid in
ANY encounter.
A human can much easier detect and correct his flaws, especially when
they are being exploited. Thus the effect of weaknesses isn't as
important for humans.
And in terms of "interesting" I must say that I find the
programs best play much more interesting than it's worst play.
With best play I don't mean some book play ofcourse, but a
fine solution to a tricky problem.
"Tricky problems" is what a computer does best, a localized search
for a solution, possibly even brute forced. This isn't very
impressive to me.
You are clearly anti-computer and your comments seem to reflect a kind
of emotional prejudice instead of logic - for instance thinking that
it's fair to set up matches in such a way that the computers
weaknesses are exaggerated even more. The computers inability to
lean is already a handicap right from the first move.
If I am anti-anything, it would be against bias in program authors and
testers. I am for intellectual honesty. If you think a program should be
compensated in its ranking for the handicap that it can't learn, you
should give the program a much higher ranking than it would get on a go
server. After all, a 1dan amateur human player may one day qualify as a
professional, or even win a professional title, thanks to his ability to
learn.
For what it's worth, I would want my own program to be tested to these
standards of mine as well, and I would play it many games trying to find
and exploit flaws, and seek other players willing to test it in the same
way, regardless what impact it would have on anyones thoughts on its
rank. Not to put down the work I'd put into it, but so that I could
improve it.
Granted, truly awesome play is currently mostly to be seen on 9*9.
But I've seen some great kills on the big board that any top
amateur could be proud of.
And how do you deal with confirmation bias? If you look for
exceptional results, do you also look for spectacular failures?
What about if a program gets an occasional brilliant win, but
still loses most of the games?
There is a system that averages the two, it's called the ELO system,
or if you prefer the Go ranking system. The spectacular failures
will be reflected in the numbers.
You comment on this a bit out of context, let me try to get it back on
track; does the ELO system, or any other ranking system, give any
program a rank on the big board that any top amateur could be proud of?
I was merely commenting on the apparent confirmation bias.
You remind me of the man who ties someones hands behind his back and
then fights him. When the handicapped man bites you, you complain
that he is not fighting fair.
If you set up the match right, you can give either player an advantage
but I for one would seriously not trust the results of such
manipulation. I think you need some kind of
reasonable justification for setting up match conditions such that one
player has every possible advantage.
So I have a suggestion. Let's play the match at the rate of game in
1 minute. I hold the human to higher standards and it doesn't seem
fair to me to set time controls in such as way that the human has such
an easy game of it. That's just not fair and it inflates the ELO of
the human player. Your counter argument better not be that this is
he accepted time control for human players - of course it is.
I am not suggesting these unfair tests because I want the human to win,
at least not directly. I wish you would give me enough credit to look
deeper than that.
What I want is to be convinced that there /isn't/ any bias in the
program's favour. That playing a hundred games /will/ give the same
result as playing five. I want a program to meet this standard of mine.
Either you think that some of todays programs can do this, in which case
my suggestion isn't unfair at all! Or you think all programs do have
flaws that can be learned and exploited by humans, and you merely have a
different standard than me on what you would consider a good go player.
This is perfectly OK. You don't have to be offended that we disagree on
our standards. If you still are, you can write it off as me just
suffering from perfectionism (which I admittedly do). Tradition has
worked out many ranking systems that works just fine, are counted as
very fair, for humans, and I do not think it unreasonable to think a
computer program should only be measured under these systems. I simply
have my own standards that takes into account at least one additional
factor that may come into play with programs that don't learn the way
humans do.
I want programs to improve under /my/ standard. I am kind of hoping
other go coders feel the same. Isn't handling the hard problem of
playing go what this is all about, and not just getting a high rating
for kudos and commercial gain? Surely tackling what your program is
worst at does this the best?
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go