Re: [computer-go] CUDA and GPU Performance

Vincent Diepeveen Sat, 12 Sep 2009 15:03:14 -0700

Thanks for sharing this Christian,
in my lines comments.


On Sep 9, 2009, at 5:54 PM, Christian Nentwich wrote:

I did quite a bit of testing earlier this year on running playoutalgorithms on GPUs. Unfortunately, I am too busy to write up a techreport on it, but I finally brought myself to take the time towrite this e-mail at least. See bottom for conclusions.
For performance testing, I used my CPU board representation, and aCUDA port of the same (with adjustments), to test the followingalgorithm:
 - Clear the board
 - Fill the board according to a uniform random policy
 - Avoid filling eyes, according to simple neighbour check
 - Avoid simple ko
 - Count the score and determine the winner
In other words: no tree search is involved, and this is thelightest possible playout. The raw numbers are as follows:- CPU Search: 47,000 playouts per CPU core per second, on an Intel6600 Core-2 Duo- GPU Search: 170,000 playouts per second, on an NVidia Geforce285 card

Interesting to know is how many nodes per second this is that hammersto the local shared memoryand which IPC you overall had. This 9% might not be very accurateread further for that.

nvidia is of course the weakest of all GPU manufacturers currently ifyou run on the gpu's rather than the tesla.Nvidia is of course most famous as they were the first with gpgpu forthe big masses and promoting this.Note in 90s i already stumbled upon implementations of chessprogramsat graphics cards by hardware designers who did do this as a hobby,

at the time they were higher clocked than cpu's in many occasions.

Nvidia is not really revealing what instructions the gpu's have, sothis is a major bottleneck in your software program probably.To start with it is quite possible that entire 'gflops' calculationof nvidia is totally based upon combined instructions,

such as multiplyadd.

So streamcore in principe can do at maximum 1 instruction a cycle,yet the numbers Nvidia slams around with are based uponthings that are counted as 2 instructions (so 8 flops) whereas inreality it is 1 instruction at the gpu.

If you're not using this in the program then that already hammersdown theoretical gains one can theoretical achieve at nvidia.


Other manufacturers seem more open here.

Note that privately approaching Nvidia also didn't help. It was forsome potential pilot project a while ago for simulation software(simulation of whatevergenerics up to fighter jet software). Obviously a pilot is a pilotproject and not some massive deal. Nvidia indicated only wanting toreveal *perhaps* something

in case of a paper deal of really a lot of 'tesla' cards.

As we know those are a bit pricey (not for the simulator clients), soa deal of hundreds or thousands of pre-ordered cards, that would notwork out of course.

A pilot project is there to convince something is possible.

We know that the bandwidth from the gpu's for gpgpu work is a lotworse than from the tesla versions of nvidia,which is throughout your story the problem. However i see solutionsthere.

Also your result is not so bad. Most likely you wrote quite efficientsoftware in CUDA.So basically what so to speak limited your nps is the fact that yoursoftware program probably is doing very little work a node.

When using more knowledge, it is quite possible that you hardlyslowdown, as latency to the memory is simply what you keep hitting.

So maybe you can make the program lossless more knowledgeable.

Also what i am missing fro myour side is some technical data withrespect to how many bytes you lookup in shared RAM for eachstreamcore at each node.For example, i might be confusing device RAM here, i thoughtcacheline size is 256 bytes per read to each core.

So if you use considerable less than that, it's seriously slower ofcourse.

The algorithm running on the GPU is a straight port, with severaloptimisations then made to severely restrict memory access. Thismeans the algorithm is a "naive" sort of parallel algorithm,parallel on a per-board level like the CPU implementation, ratherthan per-intersection or some other sort of highly parallel algorithm.
Memory access other than shared processor memory carries a severepenalty on the GPU. Instead, all threads running on the GPU at anyone time have to make do with a fast shared memory of 16834 bytes. So:- The board was compressed into a bit board, using 2*21 unsignedints per thread


2 * 21 * 32 = 1344 bits

Maybe it is the case the gpu is very ugly slow in bit manipulations,

these things have not been designed to do some sort of squaredcryptography.

- The count of empty, white and black intersections and the koposition was also in shared memory per thread


it doesn't speed up near the edges to do things dirty cheap and illegal
and just gamble that illegal results don't backtrack to root ?
(this as their playstrength is real weak yet)

- String/group/block type information was in global memory, asthere was no way to store it in 16384 bytes

Maybe this gave also a severe penalty, in a few days time i'll bugsome guys from RAM companies on this subject.

Optimal speed was at 80 threads per block, with 256 blocks. Thecard had only 9% processor occupancy, due to the shared memorybeing almost exhausted. However, branch divergence was at only 2%,which is not bad at all - suggesting that the form of parallelismmay not be a block. This may be because the "usual" case of a pointeither being illegal to play, or a simple play without a need tomerge or remove strings is by far the most common case.
Conclusions:
I see these results as broadly negative with the current generationof technology. Per-board parallelism on a GPU is not worth itcompared to the CPU speed and the severe drawbacks of working on aGPU (testing is hard, unfamiliar environment for most programmers,lots of time to spend on optimisation, etc).

I see your result very positive, but indeed as you say a big problem,which also in my calculations was there of 2 years ago,

is that the cache size per core is a problem.

Did you consider AMD 770 card?

It has 4x more cache per core and each core consists out of 5execution units.

The problems would be severely compounded by trying to integrateany tree search, or heavy playouts. Trees are almost impossible toconstruct on a GPU because pointers cannot be transferred from thehost to the GPU. They could still be represented using arrays, butthe random nature of tree access would cause huge penalties as itwould prevent coalesced memory access.

GPU requires total different approach. The big advantage you have onthe thing is you can put through huge amoutns of instructions.The real problem is that it is so fast that when you access thedevice RAM that you run into serious trouble.

If some dozens of streamcores read from the same RAM that's a bigproblem.

Highly parallel algorithms (e.g. one thread per intersection) canstill be investigated, but my (unproven!) intuition is that it isnot worth it, as most intersections will be idle on any given move,wasting processor occupancy time.

nah i find the result very encouraging in fact, as you started withshitzero information on how to avoid the RAM and have no clue aboutwhich instructions the gpu actually

implements.

the trick is more to focus upon *where is the gpu good at* i feel.

You've tried a cpu approach here and are faster than a dual core anda lot.

That's much better than others who tried.

It needs a step by step improvement and more thorough understandingof how the hardware works and its capabilities,

to know where to improve.

For example with cpu's we know that the L1 cache has a throughput of1 (intel) or 2 (amd) a cycle, and a latency that's quite bad.We can easily measure to be within L1 or how much we are outside, yetjust a few do it.

If you would post on this list: "how many L1i and L1d misses do youhave?" i bet nearly no one can answer it.

In case of my chessproggie it's 1.34% misses in L1i (core2) and 0.6%to L1d (just 0.1% gets found in L2 the rest is hashtables to RAM).

From the people who know such technical details, you would needsomeone who is clever enough to find a plan and a use for what thegpu's CAN deliver,

namely more instructions a cycle, than quadcores can.

So all kind of things that work rather branchless get very cheap.

For example software with big evaluation functions (or knowledge toselect moves) will be able to improve quality of the programmeanwhile get nearly the same nps like you got,

and at quadcores they would slowdown a lot.

My feeling is that GPUs may have some potential in this area, butpossibly in a supplementary role such as running additional patternmatching in the background, or driving machine learning components.

Some chinese researchers reported already a while ago they achievedat 8800 based gpu's an efficiency of 20-25% of nvidia and 40%-50% atATI/AMD cards for theirembarrassingly parallel software. It was not 100% clear what type ofsoftware though as they might not be happy to reveal exactly what andhow.

Now with an attempt here that's totally hammering onto the latency ofthe caches and RAM, you already get 9% at a first attempt, that'svery freakin good.


Vincent

This e-mail is a bit hurried, so.. questions are welcome!!

Christian

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] CUDA and GPU Performance

Reply via email to