Re: [Computer-go] mini-max with Policy and Value network

Hideki Kato Wed, 07 Jun 2017 09:18:12 -0700

Generalizing shoulder-hit moves on lower lines may prefer 
the move in question.


Hideki

Gian-Carlo Pascutto: <df55c9d4-2f0a-d902-af71-7677497fc...@sjeng.org>:
>On 23-05-17 17:19, Hideki Kato wrote:
>> Gian-Carlo Pascutto: <0357614a-98b8-6949-723e-e1a849c75...@sjeng.org>:
>> 
>>> Now, even the original AlphaGo played moves that surprised human pros
>>> and were contrary to established sequences. So where did those come
>>> from? Enough computation power to overcome the low probability?
>>> Synthesized by inference from the (much larger than mine) policy network?
>> 
>> Demis Hassabis said in a talk:
>> After the game with Sedol, the team used "adversarial learning" in 
>> order to fill the holes in policy net (such as the Sedol's winning 
>> move in the game 4).
>
>I said, the "original AlphaGo", i.e. the one used in the match against
>Lee Sedol. According to the Nature paper, the policy net was trained
>with supervised learning only [1]. And yet...
>
>In the attached SGF, AlphaGo played P10, which was considered a very
>surprising move by all commentators. Presumably, this means it's not
>seen in high level human play, and would not get a high rating in the
>policy net. I can sort-of confirm this:
>
>0.295057654 (E13)
>...(60 more moves follow)...
>0.000011952 (P10)
>
>So, 0.001% probability. Demis commented that Lee Sedol's winning move in
>game 4 was a one in 10 000 move. This is a 1 in 100 000 move.
>Differently trained policy nets might rate it a bit higher or lower, but
>simply due to the fact that was considered very un-human to do, it seems
>unlikely to ever be rated highly by a policy net based on supervised
>learning.
>
>So in AlphaGo's formula, you're dealing with a reduction of the UCT term
>by a factor 100 000 plus or minus some order of magnitude.
>
>  D6 -> 1359934 (W: 53.21%) (U: 49.34%) (V: 55.15%:  38918) (N:  6.3%)
>PV: D6 F6 E7 F7 C8 B8 D7 B7 E9 C9 F8 H7 H
>9 K7 H3 K9
>...many moves...
> P10 ->     421 (W: 52.68%) (U: 50.09%) (V: 53.98%:      8) (N:  0.0%)
>PV: P10 Q10 P8 Q9
>
>Now, of course AlphaGo had a few orders of magnitude more hardware, but
>you can see from the above that it's, eh, not easy for P10 to overtake
>the top moves here in playout count.
>
>And yet, that's the move that was played.
>
>[1] I'm assuming that what played the match corresponds to what they
>published there - maybe that is my mistake. I'm not sure I remember the
>relevant timeline correctly.
-- 
Hideki Kato <mailto:hideki_ka...@ybb.ne.jp>
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

Reply via email to