I have got a lockless hash table to work, and I thought I'd share the results.

A lockless hash table is very important, because the usual approach that consists in using a global lock during tree search and update does not scale well, especially on 9x9. But it is possible to create a completely lockless hash table data structure that works with multiple threads.

Here are some links that give some indications of how such a thing can be done:

Some of those links may look intimidating, but that is because the resize part of the algorithm is complicated. In my implementation, I do not resize the table, so it is very simple. Also, I update counter in each node with atomic increments and decrements (no need to lock).

Here is some preliminary experimental data for 9x9 up to 8 cores, running 840,000 playouts, from a tactical middle-game position:

(Cores / Playouts per second with spinlock / Playouts per second with lockless hash table)
1  22,477.9  22,447.9
2  37,769.8  43,076.9
3  55,888.2  60,825.5
4  61,448.4  79,470.2
5  64,665.1  95,346.2
6  62,407.1 110,092.0
7  66,508.3 130,638.0
8  59,196.6 146,341.0

BTW, using a pthread mutex is a lot worse than a spinlock, in my experience. I used the fair spinlock from the Linux kernel. But any implementation should work.

So, it is pretty cool. This was measured on only one run. Since it is not deterministic, performance may vary from one run to the other (especially since it does not always select the same best move). But it still clearly shows the superiority of the lockless hash table, and seems to indicate that it would still scale beyond 8 cores.

I believe I could improve further by reducing the number of atomic operations. Also, thinking about how to reduce atomic operations led me to imagine a scheme that may works as a distributed hash table over a network of PCs.

A simple scheme that would work on a single PC would consist in storing one counter per thread in the table. This way, it would not be necessary to use atomic operations to increment counters, and the cache coherency mechanism of the CPUs would handle sending data from core to core. The cost would be that in order to get the node counters, it would be necessary to add N values. Also, some information may arrive a little late to some threads (but I believe it is better to go run a playout rather than wait for data).

This scheme is a bit equivalent to using a separate hash table for each thread, and could be generalized to a distributed hash table over a network: each PC would have its own hash table, and each node of the tree would contain two counters: my_counter and other_counter. Every now and then, for instance when my_counter reaches a threshold, this PC would broadcast my_counter to the whole network, so that everybody updates other_counter.

I have not implemented this yet, but I will probably try it.

Right now, I will test the lockless hash table more, and will probably connect to 9x9 CGOS with that machine sometime during the week-end.

If you wish to implement your own lockless hash table, you should read Intel's documentation about its memory architecture. It can be found there:
In particular, it is important to read "Architecture Memory Ordering White Paper", and about the lock prefix, the cmpxchg operation, sfence, lfence, and mfence.

The primitives I used in my algorithm are a store fence, atomic increment, atomic decrement, atomic compare and swap. If you understand what they do, you should be able to make your own lockless hash table.

Have fun,

computer-go mailing list

Reply via email to