[computer-go] CUDA and GPU Performance

Christian Nentwich Wed, 09 Sep 2009 08:55:05 -0700

I did quite a bit of testing earlier this year on running playoutalgorithms on GPUs. Unfortunately, I am too busy to write up a techreport on it, but I finally brought myself to take the time to writethis e-mail at least. See bottom for conclusions.

For performance testing, I used my CPU board representation, and a CUDAport of the same (with adjustments), to test the following algorithm:

 - Clear the board
 - Fill the board according to a uniform random policy
 - Avoid filling eyes, according to simple neighbour check
 - Avoid simple ko
 - Count the score and determine the winner

In other words: no tree search is involved, and this is the lightestpossible playout. The raw numbers are as follows:- CPU Search: 47,000 playouts per CPU core per second, on an Intel6600 Core-2 Duo

 - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card

The algorithm running on the GPU is a straight port, with severaloptimisations then made to severely restrict memory access. This meansthe algorithm is a "naive" sort of parallel algorithm, parallel on aper-board level like the CPU implementation, rather thanper-intersection or some other sort of highly parallel algorithm.

Memory access other than shared processor memory carries a severepenalty on the GPU. Instead, all threads running on the GPU at any onetime have to make do with a fast shared memory of 16834 bytes. So:- The board was compressed into a bit board, using 2*21 unsigned intsper thread- The count of empty, white and black intersections and the koposition was also in shared memory per thread- String/group/block type information was in global memory, as therewas no way to store it in 16384 bytes

Optimal speed was at 80 threads per block, with 256 blocks. The card hadonly 9% processor occupancy, due to the shared memory being almostexhausted. However, branch divergence was at only 2%, which is not badat all - suggesting that the form of parallelism may not be a block.This may be because the "usual" case of a point either being illegal toplay, or a simple play without a need to merge or remove strings is byfar the most common case.


Conclusions:

I see these results as broadly negative with the current generation oftechnology. Per-board parallelism on a GPU is not worth it compared tothe CPU speed and the severe drawbacks of working on a GPU (testing ishard, unfamiliar environment for most programmers, lots of time to spendon optimisation, etc).

The problems would be severely compounded by trying to integrate anytree search, or heavy playouts. Trees are almost impossible to constructon a GPU because pointers cannot be transferred from the host to theGPU. They could still be represented using arrays, but the random natureof tree access would cause huge penalties as it would prevent coalescedmemory access.

Highly parallel algorithms (e.g. one thread per intersection) can stillbe investigated, but my (unproven!) intuition is that it is not worthit, as most intersections will be idle on any given move, wastingprocessor occupancy time.

My feeling is that GPUs may have some potential in this area, butpossibly in a supplementary role such as running additional patternmatching in the background, or driving machine learning components.


This e-mail is a bit hurried, so.. questions are welcome!!

Christian

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] CUDA and GPU Performance

Reply via email to