I did quite a bit of testing earlier this year on running playout
algorithms on GPUs. Unfortunately, I am too busy to write up a tech
report on it, but I finally brought myself to take the time to write
this e-mail at least. See bottom for conclusions.
For performance testing, I used my CPU board representation, and a CUDA
port of the same (with adjustments), to test the following algorithm:
- Clear the board
- Fill the board according to a uniform random policy
- Avoid filling eyes, according to simple neighbour check
- Avoid simple ko
- Count the score and determine the winner
In other words: no tree search is involved, and this is the lightest
possible playout. The raw numbers are as follows:
- CPU Search: 47,000 playouts per CPU core per second, on an Intel
6600 Core-2 Duo
- GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card
The algorithm running on the GPU is a straight port, with several
optimisations then made to severely restrict memory access. This means
the algorithm is a "naive" sort of parallel algorithm, parallel on a
per-board level like the CPU implementation, rather than
per-intersection or some other sort of highly parallel algorithm.
Memory access other than shared processor memory carries a severe
penalty on the GPU. Instead, all threads running on the GPU at any one
time have to make do with a fast shared memory of 16834 bytes. So:
- The board was compressed into a bit board, using 2*21 unsigned ints
per thread
- The count of empty, white and black intersections and the ko
position was also in shared memory per thread
- String/group/block type information was in global memory, as there
was no way to store it in 16384 bytes
Optimal speed was at 80 threads per block, with 256 blocks. The card had
only 9% processor occupancy, due to the shared memory being almost
exhausted. However, branch divergence was at only 2%, which is not bad
at all - suggesting that the form of parallelism may not be a block.
This may be because the "usual" case of a point either being illegal to
play, or a simple play without a need to merge or remove strings is by
far the most common case.
Conclusions:
I see these results as broadly negative with the current generation of
technology. Per-board parallelism on a GPU is not worth it compared to
the CPU speed and the severe drawbacks of working on a GPU (testing is
hard, unfamiliar environment for most programmers, lots of time to spend
on optimisation, etc).
The problems would be severely compounded by trying to integrate any
tree search, or heavy playouts. Trees are almost impossible to construct
on a GPU because pointers cannot be transferred from the host to the
GPU. They could still be represented using arrays, but the random nature
of tree access would cause huge penalties as it would prevent coalesced
memory access.
Highly parallel algorithms (e.g. one thread per intersection) can still
be investigated, but my (unproven!) intuition is that it is not worth
it, as most intersections will be idle on any given move, wasting
processor occupancy time.
My feeling is that GPUs may have some potential in this area, but
possibly in a supplementary role such as running additional pattern
matching in the background, or driving machine learning components.
This e-mail is a bit hurried, so.. questions are welcome!!
Christian
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/