Hi Petr,

I missed this posting yours at feb 10.

On Feb 10, 2009, at 1:44 AM, Petr Baudis wrote:

  Hi!

  There has been some talk about implementing monte-carlo playouts on
GPUs in the past, I have heard rumours about Polish bachelor student
doing libego -> GPGPU conversion as a project, etc. but I know of
nothing concrete ever materializing - is anybody aware of anything?


Of course as we will all realize a lot of people have tried all kind of stuff on gpgpu
already and obviously at first at nvidia.

For chess at the moment it's more complicated than go as a shared hashtable is real
important in chess. This will come in go also as being real important.

Whether the current generation nvidia or amd can reach the optimum of the claimed performance is of course interesting question today. However it's not so relevant in longterm, as what is real relevant is the question: "what is theoretical possible here?"

We can simply assume that future generations improve. Be it nvidia, be it AMD, be it some other
manufacturer (intel?).

with some extra overhead it's quite possible to have a shared hashtable. All those things
i solved here on paper.

The real interesting thing is of course things like branches. How to replace them with
near to branchless code?

That's not so easy. That simply requires overhead. In case of pattern matching, it's obvious that in x86 you can skip a lot of patterns easily cheap by having a single
compare that fails. So that's a cycle or 30 you lose maximum.

Let's say probably a lot LESS cycles. 15 or so.

Not in gpgpu.

However as a compensation gpgpu's can push through more instructions.

Theoretical we can speak about a quadcore say i7 965:

3.2Ghz * 4 instructions a cycle * 4 cores = 51.2 G instructions per second = 51.2GIPS

AMD : 320 cores @ 5 execution units = 1600 execution units * 0.97Ghz * 1 instruction a cycle = 0.97 * 1600 = 1552 GIPS
NVIDIA : 480 streamcores * 1.5Ghz * 1 instruction a cycle = 720 GIPS
note sometimes nvidia counts double a cycle as there is multiply add in floating point, so i don't want to say that AMD is superior here to nvidia or vice versa, as it is a total different architecture to start with and obviously AMD is not capable of doing multiplication with all 5 units of each full core; so the gflops count is something total different
here than what we do namely game tree search.

For me a factor 2 here is really not interesting to discuss, so i really don't want to get into a discussion which of the 2 architectures is faster for game tree search at this moment. I might when i get sponsored by one of the manufacturers though :)

The interesting thing is that to AMD the fastest intel quadcore chip is say factor 30 slower on paper if we look to the raw GFLOP.

For all kind of gflop researches we know that the x86's really get far over 90% efficiency and the best claim ever of serious software that runs in huge numbers at gpu's, they claim 40% efficiency.

Scientists who claim therefore more than say factor 15 speed difference of a gpu versus x86, they are not objective;
if not committing complete fraud on paper.

To my estimate it would be possible to do things at a factor 5 loss. So you're 6 times faster then than a quadcore. As i like numbers that are a multiple of 5 in this case, i round it down to 5.

Whatever complex calculation i do taking into account all factors, after months of study over a period of more than 2.5 nearly 3 years now,
i each time come down to factor 5, already for a number of years.

Achieving this is not simple. Let's first look whether it is possible to also get a memory bandwidth of factor 5.

If we look to memory bandwidth this factor 6 you can sustain realtime and nonstop 24/24 hours a day, as the memory bandwidth claimed of the i7 is say 20-30GB/s and the topend gpu's now definitely are over factor 5 faster in bandwidth. AMD has DDR5 there (not sure nvidia already has, but if not then they'll follow soon), so to speak the gpu's have 2 generations newer RAM than the quadcores. Just the difference in RAM already means factor 4 faster for the gpu's, multiply to that more memory controllers that they use than the cpu's and fact that a lot gets done in a decentral manner at gpu's
whereas in cpu's there is a centralized L3 cache to the RAM.

So bandwidth wise seen it is quite possible to be practical 5 times faster or so up to 10 times if you're a bit more optimistic.

However to get it out of the gpu's is not so easy for game tree search. Optimizing low level is already really complicated. Here you also need 2 layer parallel model at least and figure out exactly where the chip can get a high IPC (number of
instructions per cycle) and how to get a high IPC done.

We have had now say 30 years of optimizations for processors in a sequential manner with branches. Programming code that's with nearly no branches is a total different league. Using the many different instructions that implement conditional move
type instructions is really complicated.

In other sciences such as number theory where there is such software that uses the utmost low level code vectorwise for SSE2+, MMX you name it, we typical see that out of the millions of math enthusiasts there is worldwide, only litterary a small handful knows how to really code it that low level and it takes years and years of optimizatiosn to get it done.

In that sense it's a big learningproces for us all and only a few will be ABLE to succeed for now, until it is common knowledge how to
get things done.

The big question is of course after having done so much effort to run well at a gpu: how many of your program can you sell?

That is a very nasty question isn't it?

Please note that larrabee from intel is a different type of chip, despite what intel wants you to believe. To achieve the same thing what you can easily do on a gpu, you can also do it on larrabee, just it uses bigger vectors.

So to start with the programming model is more complicated. Each vector is 512 bits.

However to see each element as a separated 'cpu' keep in mind you need indirect memory accesses. Gather is one of the 2 you need. it has a huge latency. 7+ cycles and you need these instructions nonstop.

So any claim that it could perform for stuff like game tree search in the same manner is total nonsense. Any calculation of above therefore doesn't apply to larrabee. See larrabee as a full vector processor and the gpgpu's as manycores where indirect memory accesses are a lot easier to do and faster than at larrabee and you'll need these nonstop of course at larrabee; if you wouldn't do them then each vector of 512 bits would need to be 1 single 'go board'. Though in theory possible, that would mean you only have a core or 10 to your avail, versus 3...@1600 or 480 at a x2 gpu card.

I know a lot of quadcores that are going to easily beat 10 vector cores for game tree search.

So porting game tree search to full vector processors is going to be a lot tougher than gpgpu is my estimation. It's not impossible of course. Just a lot harder
and it's more puzzle work of course.

Vincent

  We have recently bought very high-end nVidia card at our university
department for trying out GPGPU and I'm thinking of a little project for
myself, maybe...

  I'm not much skilled in this kind of programming though, so I'm not
quite sure what the best design approach to take would be... my current
ideas so far are either

  (i) Have one game per pipeline, and in each cycle, take a board and
transform it by playing a random move (or possibly have an extra cleanup
cycles if capturing stones would take too long); you should be able to
do some neat pattern matching and so too

  (ii) Have one intersection per pipeline, in one cycle play a random
move, then have board_height cycles for captures and liberty updates
to ripple through to all the neighbors. The code would be much more
streamlined, but I'm not sure yet if there is a good way to do the
rippling - ORing bitmaps?

  I guess there is a lot of things to try out.

--
                                Petr "Pasky" Baudis
The average, healthy, well-adjusted adult gets up at seven-thirty
in the morning feeling just terrible. -- Jean Kerr
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to