I have got a lockless hash table to work, and I thought I'd share the
A lockless hash table is very important, because the usual approach that
consists in using a global lock during tree search and update does not
scale well, especially on 9x9. But it is possible to create a completely
lockless hash table data structure that works with multiple threads.
Here are some links that give some indications of how such a thing can
Some of those links may look intimidating, but that is because the
resize part of the algorithm is complicated. In my implementation, I do
not resize the table, so it is very simple. Also, I update counter in
each node with atomic increments and decrements (no need to lock).
Here is some preliminary experimental data for 9x9 up to 8 cores,
running 840,000 playouts, from a tactical middle-game position:
(Cores / Playouts per second with spinlock / Playouts per second with
lockless hash table)
1 22,477.9 22,447.9
2 37,769.8 43,076.9
3 55,888.2 60,825.5
4 61,448.4 79,470.2
5 64,665.1 95,346.2
6 62,407.1 110,092.0
7 66,508.3 130,638.0
8 59,196.6 146,341.0
BTW, using a pthread mutex is a lot worse than a spinlock, in my
experience. I used the fair spinlock from the Linux kernel. But any
implementation should work.
So, it is pretty cool. This was measured on only one run. Since it is
not deterministic, performance may vary from one run to the other
(especially since it does not always select the same best move). But it
still clearly shows the superiority of the lockless hash table, and
seems to indicate that it would still scale beyond 8 cores.
I believe I could improve further by reducing the number of atomic
operations. Also, thinking about how to reduce atomic operations led me
to imagine a scheme that may works as a distributed hash table over a
network of PCs.
A simple scheme that would work on a single PC would consist in storing
one counter per thread in the table. This way, it would not be necessary
to use atomic operations to increment counters, and the cache coherency
mechanism of the CPUs would handle sending data from core to core. The
cost would be that in order to get the node counters, it would be
necessary to add N values. Also, some information may arrive a little
late to some threads (but I believe it is better to go run a playout
rather than wait for data).
This scheme is a bit equivalent to using a separate hash table for each
thread, and could be generalized to a distributed hash table over a
network: each PC would have its own hash table, and each node of the
tree would contain two counters: my_counter and other_counter. Every now
and then, for instance when my_counter reaches a threshold, this PC
would broadcast my_counter to the whole network, so that everybody
I have not implemented this yet, but I will probably try it.
Right now, I will test the lockless hash table more, and will probably
connect to 9x9 CGOS with that machine sometime during the week-end.
If you wish to implement your own lockless hash table, you should read
Intel's documentation about its memory architecture. It can be found there:
In particular, it is important to read "Architecture Memory Ordering
White Paper", and about the lock prefix, the cmpxchg operation, sfence,
lfence, and mfence.
The primitives I used in my algorithm are a store fence, atomic
increment, atomic decrement, atomic compare and swap. If you understand
what they do, you should be able to make your own lockless hash table.
computer-go mailing list