[computer-go] CUDA implementation of the per-intersection GPGPU approach

Petr Baudis Thu, 10 Sep 2009 04:01:12 -0700

  I have written a quick CUDA implementation of the per-intersection
GPGPU apprach; Christian's e-mail finally made me polish it up to
a somewhat presentable form.

  In this implementation, each GPU thread maintains a single
intersection.

  The implementation uses 9x9 board (10x11 internally); expanding to
19x19 would probably mean either moving some data to global memory
or splitting the playout of single board to multiple blocks or waiting
for GPUs with larger shared memory. ;-)

  The speed is unfortunately very disappointing; on GTX260, I see
17,300 playouts per second, with 50 playouts running in parallel.
There are several "but"s:

  - With some further obvious optimizations, I'm thinking it should be
    possible to get it to at least 22,000 playouts per second easily.

  - Bit boards are an obvious thing to try, but I'm quite doubtful
    they will be an improvement

  - The playouts are uniformly random now, but they have been
    implemented in a way to make heavier playouts easy; precise liberty
    count is maintained and it should be easy to add pattern matching;
    the move selection can deal with arbitrary probabilities for
    different intersections (I wonder, what are the standard
    high-performance numbers for heavier playouts with pattern matching?)

  - Christian is playing 2650 games in parallel; I'm playing only 50
    games in parallel, in the meantime the CPU can pick next games to
    play from the tree

  - My code already accounts for transferring board images from CPU to
    the GPU (different one for each playout) and getting scores back

  - Christian's GPU seems quite better than mine

  - Apparently I still suck at CUDA optimizations; this has been my
    first real small CUDA project, it would be awesome if someone more
    experienced could look at the code and suggest obvious improvements

  Still, I'm pretty unhappy with the slowness; I wonder how Christian
achieved such a high speed. One problem with my approach is that I have
to make use of a lot of atomic instrinsics (some of them could be worked
around) and __syncthreads() all the time to ensure that all threads have
consistent board image at each stage.

  I think with pattern matching, the per-intersection approach could
shine more, since I can easily match patterns everywhere on the board
at once. Still, I wonder if it will become at least on par with good
CPU implementations.

On Wed, Sep 09, 2009 at 04:54:23PM +0100, Christian Nentwich wrote:
> In other words: no tree search is involved, and this is the lightest
> possible playout. The raw numbers are as follows:
>  - CPU Search: 47,000 playouts per CPU core per second, on an Intel
> 6600 Core-2 Duo
>  - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card

  I still find this quite astonishing; since you consider this
experiment a failure anyway, would you mind publishing the source code?
:-)

-- 
                                Petr "Pasky" Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth

// CUDA implementation of random go player
// (c) Petr Baudis <[email protected]>  2009
// MIT-style licence; please credit me if you make use of this
// thanks to ibd for some nice ideas


// This code tries to make use of the extra-high parallelism by giving
// each board intersection a single thread; all the __device__ functions
// are run in parallel for each intersection.

// nvcc -arch sm_13 -o board_move board_move.cu

// Still, many random playouts are played in parallel on the GPU as well.
// Each one is played independently and can have a different starting
// position. Good value on my GTX 260 is 50. Careful, if you specify this
// too high, the computation will get _extremely_ slow.

// So you'd call this like: ./board_move 42 100000 50


// FIXME: No ko detection
/* Actually, I don't think ko detection is so important; it will mud down
   playouts somewhat, but eventually all the moves to be made except one
   ko fight are made anyway and MAX_MOVES catches the last ko. */

#include <stdio.h>
#include <stdlib.h>
#include <sys/times.h>
#include <unistd.h>

#define RS 9
#define S (RS + 1) // line with delimiter from neighbors
#define S2 (S*(S+1)+1) // square with border on top, left, bottom (plus an 
extra for S_EDGE SW of SWest S_NONE)
#define MAX_MOVES (RS * RS * 2)

struct board {
#define S_NONE 0
#define S_BLACK 1
#define S_WHITE 2
#define S_EDGE 3
        int stone[S2];

        /* >0: coordinate of "group center"
         * 0: no stone there
         * -1: eye forbidden for black to play
         * -2: eye forbidden for white to play
         * -3: eye forbidden for both to play */
        int group[S2];
        int libs[S2];

        float p[S2]; /* probability of play; sum = 1 */

        int random;
        int free_spots[2]; /* free spots # for black and white */
        int to_play;
        int moves;
        int komi; /* <0 black wins, >0 white wins */
} b;


__device__ unsigned int
next_random(unsigned int seed)
{
        return seed * 16807;
}

__device__ float
float_random(unsigned int seed)
{
        /* Construct (1,2) IEEE float from our random integer */
        /* http://rgba.org/articles/sfrand/sfrand.htm */
        unsigned long p = (((seed + 2) & 0x007fffff) - 1) | 0x3f800000;
        return *((float*)&p) - 1.0f;
}


#define TL (threadIdx.x - 1)
#define TR (threadIdx.x + 1)
#define TU (threadIdx.x - S)
#define TD (threadIdx.x + S)

__device__ void
update_libs(struct board *bp, int delta, int except)
{
        int groups[4] = { bp->group[TU], bp->group[TL], bp->group[TR], 
bp->group[TD] };
        /* A loop over groups[] is somehow never unrolled and groups[] is 
forced to
         local memory */
        if (groups[0] != except)
                atomicAdd(&bp->libs[groups[0]], delta);
        if (groups[1] != except && groups[1] != groups[0])
                atomicAdd(&bp->libs[groups[1]], delta);
        if (groups[2] != except && groups[2] != groups[1] && groups[2] != 
groups[0])
                atomicAdd(&bp->libs[groups[2]], delta);
        if (groups[3] != except && groups[3] != groups[2] && groups[3] != 
groups[1] && groups[3] != groups[0])
                atomicAdd(&bp->libs[groups[3]], delta);
}

__device__ void
capture_stones(struct board *bp)
{
        bp->libs[threadIdx.x] = bp->libs[bp->group[threadIdx.x]];
        if (bp->libs[threadIdx.x] == 0) {
                update_libs(bp, +1, bp->group[threadIdx.x]);
                bp->stone[threadIdx.x] = S_NONE;
                bp->group[threadIdx.x] = 0;
                atomicAdd(&bp->free_spots[0], 1);
                atomicAdd(&bp->free_spots[1], 1);
        }
}

__device__ void
dprint_board(struct board *bp)
{
#if 0
        for (int i = 0; i < S+1; i++) {
                int j;
                for (j = 0; j < S; j++) {
                        int st = bp->stone[i * S + j];
                        printf("%c ", st == S_EDGE ? '#' : st == S_WHITE ? 'O' 
: st == S_BLACK ? 'X' : '.');
                }
                printf(" ");
                for (j = 0; j < S; j++)
                        printf("%03d ", bp->group[i * S + j]);
                printf(" ");
                for (j = 0; j < S; j++)
                        printf("%02d ", bp->libs[i * S + j]);
#if 0
                printf(" ");
                for (j = 0; j < S; j++)
                        printf("%1.02f ", bp->p[i * S + j]);
#endif
                printf("\n");
        }
        printf("random %d moves %d free_spots %d,%d to_play %d komi %d\n", 
bp->random, bp->moves, bp->free_spots[0], bp->free_spots[1], bp->to_play, 
bp->komi);
#endif
}

__device__ void
survey_eye(struct board &b)
{
        int nei[4] = {TU, TL, TR, TD};
        int dnei[4] = {TU-1, TU+1, TD-1, TD+1};
        int stonecount = 0; // each byte is one direction
        /* We have to manually do bit magic, nvcc is too stupid and
         would force stonecount[] to local memory */
#define STONECOUNT_ANY(stone) (stonecount & (0xf << ((stone) * 4)))
#define STONECOUNT(stone) ((stonecount & (0xf << ((stone) * 4))) >> ((stone) * 
4))
        for (int i = 0; i < 4; i++) {
                int s = b.stone[nei[i]];
                int sc = STONECOUNT(s) + 1;
                stonecount = stonecount & ~(0xf << (s * 4)) | (sc << (s * 4));
        }
        if (STONECOUNT_ANY(S_NONE) || (STONECOUNT_ANY(S_BLACK) && 
STONECOUNT_ANY(S_WHITE)))
                return;
        bool is_white = STONECOUNT_ANY(S_WHITE);
        bool on_edge = STONECOUNT_ANY(S_EDGE);

        /* False eyes aren't forbidden, however. */
        /* XXX: We don't support http://senseis.xmp.net/?TwoHeadedDragon */
        stonecount = 0;
        for (int i = 0; i < 4; i++) {
                int s = b.stone[dnei[i]];
                int sc = STONECOUNT(s) + 1;
                stonecount = stonecount & ~(0xf << (s * 4)) | (sc << (s * 4));
        }
        if (on_edge + STONECOUNT(is_white ? S_BLACK : S_WHITE) > 1) {
                /* This might've been forbidden eye in past - ponnuki */
                if (b.group[threadIdx.x] < 0) {
                        if ((-b.group[threadIdx.x]) & S_BLACK)
                                atomicAdd(&b.free_spots[0], 1);
                        if ((-b.group[threadIdx.x]) & S_WHITE)
                                atomicAdd(&b.free_spots[1], 1);
                        b.group[threadIdx.x] = 0;
                }
                return;
        }

        /* In case of the last liberty, the other player can play in the eye. */
        bool last_lib = (b.stone[TL] != S_EDGE ? b.libs[TL] : b.libs[TR]) < 2;
        switch (b.group[threadIdx.x]) {
                case 0:
                        /* Freshly appeared eye; remove it from the relevant 
pools */
                        if (!last_lib || !is_white)
                                atomicSub(&b.free_spots[0], 1);
                        if (!last_lib || is_white)
                                atomicSub(&b.free_spots[1], 1);
                        break;
                case -1:
                        /* Formerly half-forbidden eye; if not anymore, remove 
it from the other player's pool */
                        if (!last_lib)
                                atomicSub(&b.free_spots[1], 1);
                        break;
                case -2:
                        if (!last_lib)
                                atomicSub(&b.free_spots[0], 1);
                        break;
                case -3:
                        /* Formerly forbidden eye; possibly allow one player to 
play inside */
                        if (last_lib)
                                atomicAdd(&b.free_spots[1 - is_white], 1);
                        break;
        }
        b.group[threadIdx.x] = last_lib ? -1 - is_white : -3;
}

__device__ void
calc_probability(struct board &b)
{
        int group = b.group[threadIdx.x];
        b.p[threadIdx.x] = 1.F / b.free_spots[b.to_play - 1];
        if (b.stone[threadIdx.x] != S_NONE)
                b.p[threadIdx.x] = 0;
        if (group < 0 && ((-group) & b.to_play))
                b.p[threadIdx.x] = 0;
}

__device__ void
play_one_move(struct board &b)
{
        __shared__ int group_merge_n, group_merge[4];
        __shared__ int move;

        /** Get a random number */
        if (!threadIdx.x)
                b.random = next_random(b.random);

        /** Choose a move to play. */
        /* So-called weighted random selection; build a tree of probability
           bounds in O(logN), then check one node per thread. We don't need
           to bother with downsweep-reduce, our array is fairly tiny. */
        /* upbound[] is double-buffered */
        __shared__ float upbound[S2 * 2];
        int outo = 0, ino = 1;
        upbound[threadIdx.x] = b.p[threadIdx.x];
        __syncthreads();
        for (int d = 1; d < S2; d *= 2) {
                outo = 1 - outo; ino = 1 - ino;
                if (threadIdx.x >= d)
                        upbound[outo * S2 + threadIdx.x] = upbound[ino * S2 + 
threadIdx.x] + upbound[ino * S2 + threadIdx.x - d];
                else
                        upbound[outo * S2 + threadIdx.x] = upbound[ino * S2 + 
threadIdx.x];
                __syncthreads();
        }

        /** Place the stone */

        // p must not be 0 (never matched) nor 1 (may be matched many times on 
the edge)
        float p = float_random(b.random);
        int to_play;
        // printf("[%d] %f < %f < %f\n", threadIdx.x, threadIdx.x ? 
upbound[outo * S2 + threadIdx.x - 1] : -1.F, p, upbound[outo * S2 + 
threadIdx.x]);
        if (p <= upbound[outo * S2 + threadIdx.x] && (!threadIdx.x || 
upbound[outo * S2 + threadIdx.x - 1] < p)) {
                move = threadIdx.x;
                b.stone[move] = b.to_play;
                /* Take off liberty from surrounding groups */
                update_libs(&b, -1, 0);
                group_merge_n = 0;
        } else {
                to_play = b.to_play;
        }
        __syncthreads();

        /** Survey if the stone can join existing group */
        switch (move - (int)threadIdx.x) {
                case 0:
                        to_play = b.to_play == S_BLACK ? S_WHITE : S_BLACK;
                        // XXX: Two ifs are probably more efficient than full 
branch
                        if (b.group[threadIdx.x] < 0) {
                                /* Half-forbidden eye */
                                atomicSub(&b.free_spots[b.to_play - 1], 1);
                        } else {
                                atomicSub(&b.free_spots[0], 1);
                                atomicSub(&b.free_spots[1], 1);
                        }
                        b.to_play = to_play;
                        b.moves++;
                        // printf("Z %d\n", move);
                        break;
                case -S:
                case -1:
                case 1:
                case S:
                        if (b.stone[threadIdx.x] == to_play)
                                group_merge[atomicAdd(&group_merge_n, 1)] = 
b.group[threadIdx.x];
                        else if (b.stone[threadIdx.x] == S_NONE)
                                atomicAdd(&b.libs[move], 1);
                        break;
        }
        __syncthreads();

        /** Merge multiple groups if applicable */
        if (group_merge_n > 1) {
                if (threadIdx.x == move)
                        b.group[threadIdx.x] = group_merge[0];
                else if (threadIdx.x == group_merge[0])
                        b.libs[threadIdx.x] = 0;

                for (int i = 1; i < group_merge_n; i++)
                        if (b.group[threadIdx.x] == group_merge[i])
                                b.group[threadIdx.x] = group_merge[0];
                __syncthreads();

                /* Recalculate liberties */
                if (b.stone[threadIdx.x] == S_NONE
                    && (b.group[TU] == group_merge[0]
                        || b.group[TD] == group_merge[0]
                        || b.group[TL] == group_merge[0]
                        || b.group[TR] == group_merge[0]))
                        atomicAdd(&b.libs[group_merge[0]], 1);

        /** otherwise just join the group and survey bonus libs for the group */
        } else if (group_merge_n == 1) {
                switch (move - (int)threadIdx.x) {
                        case 0:
                                b.group[threadIdx.x] = group_merge[0];
                                break;
                        case -S:
                                if (b.stone[threadIdx.x] != S_NONE)
                                        break;
                                if (b.group[TD] != group_merge[0]
                                    && b.group[TL] != group_merge[0]
                                    && b.group[TR] != group_merge[0])
                                        atomicAdd(&b.libs[group_merge[0]], 1);
                                break;
                        case -1:
                                if (b.stone[threadIdx.x] != S_NONE)
                                        break;
                                if (b.group[TU] != group_merge[0]
                                    && b.group[TD] != group_merge[0]
                                    && b.group[TR] != group_merge[0])
                                        atomicAdd(&b.libs[group_merge[0]], 1);
                                break;
                        case 1:
                                if (b.stone[threadIdx.x] != S_NONE)
                                        break;
                                if (b.group[TU] != group_merge[0]
                                    && b.group[TD] != group_merge[0]
                                    && b.group[TL] != group_merge[0])
                                        atomicAdd(&b.libs[group_merge[0]], 1);
                                break;
                        case S:
                                if (b.stone[threadIdx.x] != S_NONE)
                                        break;
                                if (b.group[TU] != group_merge[0]
                                    && b.group[TL] != group_merge[0]
                                    && b.group[TR] != group_merge[0])
                                        atomicAdd(&b.libs[group_merge[0]], 1);
                                break;
                }

        /** or create a new group! */
        } else {
                if (threadIdx.x == move)
                        b.group[threadIdx.x] = threadIdx.x;
        }

        __syncthreads();

        /* Both following capture tests propagate liberties themselves */

        /** Take out opponent's stones */
        if (b.group[threadIdx.x] > 0 && b.stone[threadIdx.x] == b.to_play) {
                capture_stones(&b);
        }
        __syncthreads();

        /** Take out our stones */
        if (b.group[threadIdx.x] > 0 && b.stone[threadIdx.x] != b.to_play) {
                capture_stones(&b);
        }
        __syncthreads();

        /** Propagate liberties */
        if (b.group[threadIdx.x] > 0) {
                b.libs[threadIdx.x] = b.libs[b.group[threadIdx.x]];
        }
        __syncthreads();

        /** Check if we are an eye */
        if (b.stone[threadIdx.x] == S_NONE) {
                survey_eye(b);
        }
        __syncthreads();

        /** Update probabilities */
        if (b.free_spots[b.to_play - 1] > 0) {
                calc_probability(b);
                __syncthreads();
        }

        if (!threadIdx.x)
                dprint_board(&b);
}

__device__ void
board2board(struct board *b1, struct board *b2)
{
        /* First thread loads global state */
        if (!threadIdx.x) {
                b2->random = b1->random;
                b2->free_spots[0] = b1->free_spots[0];
                b2->free_spots[1] = b1->free_spots[1];
                b2->to_play = b1->to_play;
                b2->moves = b1->moves;
                b2->komi = b1->komi;
        }
        /* Then each thread loads one element */
        b2->stone[threadIdx.x] = b1->stone[threadIdx.x];
        b2->group[threadIdx.x] = b1->group[threadIdx.x];
        b2->libs[threadIdx.x] = b1->libs[threadIdx.x];
        b2->p[threadIdx.x] = b1->p[threadIdx.x];
}

__device__ void
count_score(struct board &b, int &score)
{
        /* XXX: This is horribly ineffective. */
        __shared__ int black, white;
        if (!threadIdx.x) {
                black = white = 0;
        }
        __syncthreads();
        switch (b.stone[threadIdx.x]) {
                case S_BLACK: atomicAdd(&black, 1); break;
                case S_WHITE: atomicAdd(&white, 1); break;
                case S_NONE:
                        if (b.stone[TL] == S_BLACK || b.stone[TR] == S_BLACK)
                                atomicAdd(&black, 1);
                        else
                                atomicAdd(&white, 1);
                        break;
        }
        __syncthreads();
        if (!threadIdx.x)
                score = b.komi + white - black;
}

__global__ void
player(struct board *gb, int *score)
{
        /** First, load board into shared memory */
        __shared__ struct board b;
        board2board(&gb[blockIdx.x], &b);

        __syncthreads();

        /** Play the game */

#if 0
        /* For device code debugging - run fixed number of iterations */
        for (int i = 0; i < 128; i++) {
#else
        while (b.moves < MAX_MOVES && b.free_spots[0] + b.free_spots[1] > 0) {
#endif
                if (b.free_spots[b.to_play - 1] > 0) {
                        play_one_move(b);
                } else {
                        /* pass and let the other player make a move */
                        if (!threadIdx.x) {
                                b.moves++;
                                b.to_play = b.to_play == S_BLACK ? S_WHITE : 
S_BLACK;
                        }
                        __syncthreads();
                        calc_probability(b);
                        __syncthreads();
                }
        }

        /** Count score */
        count_score(b, score[blockIdx.x]);

        /** Send board back */
        //board2board(&b, &gb[blockIdx.x]);
}


void
print_board(struct board *bp)
{
        for (int i = 0; i < S+1; i++) {
                int j;
                for (j = 0; j < S; j++) {
                        int st = bp->stone[i * S + j];
                        printf("%c ", st == S_EDGE ? '#' : st == S_WHITE ? 'O' 
: st == S_BLACK ? 'X' : '.');
                }
                printf(" ");
                for (j = 0; j < S; j++)
                        printf("%03d ", bp->group[i * S + j]);
                printf(" ");
                for (j = 0; j < S; j++)
                        printf("%02d ", bp->libs[i * S + j]);
#if 0
                printf(" ");
                for (j = 0; j < S; j++)
                        printf("%1.02f ", bp->p[i * S + j]);
#endif
                printf("\n");
        }
        printf("random %d moves %d free_spots %d,%d to_play %d komi %d\n", 
bp->random, bp->moves, bp->free_spots[0], bp->free_spots[1], bp->to_play, 
bp->komi);
}

clock_t start_time;
void timestats(void) {
        struct tms t;
        clock_t now = times(&t);
        int u = sysconf(_SC_CLK_TCK);
        printf("TIMES: user %fs, system %fs, total %fs\n",
                        (float)t.tms_utime / u,
                        (float)t.tms_stime / u,
                        (float)(now - start_time) / u);
}

int
main(int argc, char *argv[])
{
        if (argc < 3) {
                fprintf(stderr, "Usage: %s RANDSEED PLAYOUTS PLAYOUTSPERJOB\n", 
argv[0]);
                exit(EXIT_FAILURE);
        }
        b.random = atoi(argv[1]);
        int iters = atoi(argv[2]), ppj = atoi(argv[3]);
        b.free_spots[0] = b.free_spots[1] = RS * RS;
        b.moves = b.komi = 0;
        b.to_play = S_BLACK;
        for (int i = 0; i < S2; i++) {
                b.stone[i] = (i % S == 0 || i / S == 0 || i / S >= S) ? S_EDGE 
: S_NONE;
                b.group[i] = 0;
                b.libs[i] = 0;
                if (b.stone[i] == S_NONE)
                        b.p[i] = 1.f / b.free_spots[b.to_play - 1];
                else
                        b.p[i] = 0;
        }
        //print_board(&b);

        start_time = times(NULL);

        int score[ppj];
        struct board *gb; int *gscore;
        cudaMalloc((void**) &gb, sizeof(*gb) * ppj);
        cudaMalloc((void**) &gscore, sizeof(*gscore) * ppj);

        int black = 0, white = 0;

        for (int i = 0; i < iters; i += ppj) {
                //printf("Copying boards to GPU...\n");
                for (int j = 0; j < ppj; j++) {
                        cudaMemcpy(&gb[j], &b, sizeof(b), 
cudaMemcpyHostToDevice);
                        b.random++;
                }

                int blocks = ppj;
                int threads = S2;
                //timestats();
                //printf("Crunching...\n");
                player <<< blocks, threads >>> (gb, gscore);
                //timestats();

                //printf("Copying score back...\n");
                cudaMemcpy(&score, gscore, sizeof(*gscore) * ppj, 
cudaMemcpyDeviceToHost);
                for (int j = 0; j < ppj; j++)
                        if (score[j] > 0)
                                white++;
                        else if (score[j] < 0)
                                black++;
#if 0
                for (int j = 0; j < ppj; j++)
                        printf("%d ", score[j]);
                printf("\n");
#endif
#if 0
                struct board b0;
                cudaMemcpy(&b0, &gb[0], sizeof(b0), cudaMemcpyDeviceToHost);
                print_board(&b0);
#endif
                //timestats();
        }

        cudaFree(gb);
        cudaFree(gscore);
        cudaThreadExit();
        timestats();
        printf("Win stats: %.4f%% for black (%d games)\n", 
(float)black/(black+white), black);
        return EXIT_SUCCESS;
}

_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] CUDA implementation of the per-intersection GPGPU approach

Reply via email to