Re: [computer-go] GPGPU

Vincent Diepeveen Mon, 14 Sep 2009 19:48:43 -0700

Hi Petr,

I missed this posting yours at feb 10.


On Feb 10, 2009, at 1:44 AM, Petr Baudis wrote:

  Hi!

  There has been some talk about implementing monte-carlo playouts on
GPUs in the past, I have heard rumours about Polish bachelor student
doing libego -> GPGPU conversion as a project, etc. but I know of
nothing concrete ever materializing - is anybody aware of anything?

Of course as we will all realize a lot of people have tried all kindof stuff on gpgpu

already and obviously at first at nvidia.

For chess at the moment it's more complicated than go as a sharedhashtable is real

important in chess. This will come in go also as being real important.

Whether the current generation nvidia or amd can reach the optimum ofthe claimedperformance is of course interesting question today. However it's notso relevantin longterm, as what is real relevant is the question: "what istheoretical possible here?"

We can simply assume that future generations improve. Be it nvidia,be it AMD, be it some other

manufacturer (intel?).

with some extra overhead it's quite possible to have a sharedhashtable. All those things

i solved here on paper.

The real interesting thing is of course things like branches. How toreplace them with

near to branchless code?

That's not so easy. That simply requires overhead. In case of patternmatching,it's obvious that in x86 you can skip a lot of patterns easily cheapby having a single

compare that fails. So that's a cycle or 30 you lose maximum.

Let's say probably a lot LESS cycles. 15 or so.

Not in gpgpu.

However as a compensation gpgpu's can push through more instructions.

Theoretical we can speak about a quadcore say i7 965:

3.2Ghz * 4 instructions a cycle * 4 cores = 51.2 G instructions persecond = 51.2GIPS

AMD : 320 cores @ 5 execution units = 1600 execution units * 0.97Ghz* 1 instruction a cycle = 0.97 * 1600 = 1552 GIPS

NVIDIA : 480 streamcores * 1.5Ghz * 1 instruction a cycle = 720 GIPS

note sometimes nvidia counts double a cycle as there is multiply addin floating point, so i don't want to say thatAMD is superior here to nvidia or vice versa, as it is a totaldifferent architecture to start with and obviously AMDis not capable of doing multiplication with all 5 units of each fullcore; so the gflops count is something total different

here than what we do namely game tree search.

For me a factor 2 here is really not interesting to discuss, so ireally don't want to get into a discussion which of the 2architectures is faster for game tree search at this moment. I mightwhen i get sponsored by one of the manufacturers though :)

The interesting thing is that to AMD the fastest intel quadcore chipis say factor 30 slower on paper if we look to the raw GFLOP.

For all kind of gflop researches we know that the x86's really getfar over 90% efficiency and the best claim everof serious software that runs in huge numbers at gpu's, they claim40% efficiency.

Scientists who claim therefore more than say factor 15 speeddifference of a gpu versus x86, they are not objective;

if not committing complete fraud on paper.

To my estimate it would be possible to do things at a factor 5 loss.So you're 6 times faster then than a quadcore.As i like numbers that are a multiple of 5 in this case, i round itdown to 5.

Whatever complex calculation i do taking into account all factors,after months of study over a period of more than 2.5 nearly 3 years now,

i each time come down to factor 5, already for a number of years.

Achieving this is not simple. Let's first look whether it is possibleto also get a memory bandwidth of factor 5.

If we look to memory bandwidth this factor 6 you can sustain realtimeand nonstop 24/24 hours a day,as the memory bandwidth claimed of the i7 is say 20-30GB/s and thetopend gpu's now definitely are over factor 5faster in bandwidth. AMD has DDR5 there (not sure nvidia already has,but if not then they'll follow soon), so to speak the gpu'shave 2 generations newer RAM than the quadcores. Just the differencein RAM already means factor 4 faster for the gpu's,multiply to that more memory controllers that they use than the cpu'sand fact that a lot gets done in a decentral manner at gpu's

whereas in cpu's there is a centralized L3 cache to the RAM.

So bandwidth wise seen it is quite possible to be practical 5 timesfaster or so up to 10 times if you're a bit more optimistic.

However to get it out of the gpu's is not so easy for game treesearch. Optimizing low level is already really complicated.Here you also need 2 layer parallel model at least and figure outexactly where the chip can get a high IPC (number of

instructions per cycle) and how to get a high IPC done.

We have had now say 30 years of optimizations for processors in asequential manner with branches. Programming codethat's with nearly no branches is a total different league. Using themany different instructions that implement conditional move

type instructions is really complicated.

In other sciences such as number theory where there is such softwarethat uses the utmost low level code vectorwise for SSE2+,MMX you name it, we typical see that out of the millions of mathenthusiasts there is worldwide, only litterary a small handful knowshow toreally code it that low level and it takes years and years ofoptimizatiosn to get it done.

In that sense it's a big learningproces for us all and only a fewwill be ABLE to succeed for now, until it is common knowledge how to

get things done.

The big question is of course after having done so much effort to runwell at a gpu: how many of your program can you sell?


That is a very nasty question isn't it?

Please note that larrabee from intel is a different type of chip,despite what intel wants you to believe.To achieve the same thing what you can easily do on a gpu, you canalso do it on larrabee, just it uses bigger vectors.

So to start with the programming model is more complicated. Eachvector is 512 bits.

However to see each element as a separated 'cpu' keep in mind youneed indirect memory accesses. Gather is one of the 2 you need.it has a huge latency. 7+ cycles and you need these instructionsnonstop.

So any claim that it could perform for stuff like game tree search inthe same manner is total nonsense. Any calculation of above thereforedoesn't apply to larrabee. See larrabee as a full vector processorand the gpgpu's as manycores where indirect memory accesses are a loteasierto do and faster than at larrabee and you'll need these nonstop ofcourse at larrabee; if you wouldn't do them then each vector of 512bits would needto be 1 single 'go board'. Though in theory possible, that would meanyou only have a core or 10 to your avail, versus 3...@1600 or 480 at ax2 gpu card.

I know a lot of quadcores that are going to easily beat 10 vectorcores for game tree search.

So porting game tree search to full vector processors is going to bea lot tougher than gpgpu is my estimation. It's not impossible ofcourse. Just a lot harder

and it's more puzzle work of course.

Vincent

  We have recently bought very high-end nVidia card at our university

department for trying out GPGPU and I'm thinking of a littleproject for

myself, maybe...

  I'm not much skilled in this kind of programming though, so I'm not

quite sure what the best design approach to take would be... mycurrent

ideas so far are either

  (i) Have one game per pipeline, and in each cycle, take a board and

transform it by playing a random move (or possibly have an extracleanup

cycles if capturing stones would take too long); you should be able to
do some neat pattern matching and so too

  (ii) Have one intersection per pipeline, in one cycle play a random
move, then have board_height cycles for captures and liberty updates
to ripple through to all the neighbors. The code would be much more
streamlined, but I'm not sure yet if there is a good way to do the
rippling - ORing bitmaps?

  I guess there is a lot of things to try out.

--
                                Petr "Pasky" Baudis
The average, healthy, well-adjusted adult gets up at seven-thirty
in the morning feeling just terrible. -- Jean Kerr
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/


_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] GPGPU

Reply via email to