Re: [Computer-go] mini-max with Policy and Value network

Gian-Carlo Pascutto Tue, 23 May 2017 13:29:58 -0700

On 23-05-17 17:19, Hideki Kato wrote:
> Gian-Carlo Pascutto: <0357614a-98b8-6949-723e-e1a849c75...@sjeng.org>:
> 
>> Now, even the original AlphaGo played moves that surprised human pros
>> and were contrary to established sequences. So where did those come
>> from? Enough computation power to overcome the low probability?
>> Synthesized by inference from the (much larger than mine) policy network?
> 
> Demis Hassabis said in a talk:
> After the game with Sedol, the team used "adversarial learning" in 
> order to fill the holes in policy net (such as the Sedol's winning 
> move in the game 4).


I said, the "original AlphaGo", i.e. the one used in the match against
Lee Sedol. According to the Nature paper, the policy net was trained
with supervised learning only [1]. And yet...

In the attached SGF, AlphaGo played P10, which was considered a very
surprising move by all commentators. Presumably, this means it's not
seen in high level human play, and would not get a high rating in the
policy net. I can sort-of confirm this:

0.295057654 (E13)
...(60 more moves follow)...
0.000011952 (P10)

So, 0.001% probability. Demis commented that Lee Sedol's winning move in
game 4 was a one in 10 000 move. This is a 1 in 100 000 move.
Differently trained policy nets might rate it a bit higher or lower, but
simply due to the fact that was considered very un-human to do, it seems
unlikely to ever be rated highly by a policy net based on supervised
learning.

So in AlphaGo's formula, you're dealing with a reduction of the UCT term
by a factor 100 000 plus or minus some order of magnitude.

  D6 -> 1359934 (W: 53.21%) (U: 49.34%) (V: 55.15%:  38918) (N:  6.3%)
PV: D6 F6 E7 F7 C8 B8 D7 B7 E9 C9 F8 H7 H
9 K7 H3 K9
...many moves...
 P10 ->     421 (W: 52.68%) (U: 50.09%) (V: 53.98%:      8) (N:  0.0%)
PV: P10 Q10 P8 Q9

Now, of course AlphaGo had a few orders of magnitude more hardware, but
you can see from the above that it's, eh, not easy for P10 to overtake
the top moves here in playout count.

And yet, that's the move that was played.

[1] I'm assuming that what played the match corresponds to what they
published there - maybe that is my mistake. I'm not sure I remember the
relevant timeline correctly.

-- 
GCP

sedol.sgf
Description: application/go-sgf

_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

Reply via email to