Hi Petr,
I missed this posting yours at feb 10.
On Feb 10, 2009, at 1:44 AM, Petr Baudis wrote:
Hi!
There has been some talk about implementing monte-carlo playouts on
GPUs in the past, I have heard rumours about Polish bachelor student
doing libego -> GPGPU conversion as a project, etc. but I know of
nothing concrete ever materializing - is anybody aware of anything?
Of course as we will all realize a lot of people have tried all kind
of stuff on gpgpu
already and obviously at first at nvidia.
For chess at the moment it's more complicated than go as a shared
hashtable is real
important in chess. This will come in go also as being real important.
Whether the current generation nvidia or amd can reach the optimum of
the claimed
performance is of course interesting question today. However it's not
so relevant
in longterm, as what is real relevant is the question: "what is
theoretical possible here?"
We can simply assume that future generations improve. Be it nvidia,
be it AMD, be it some other
manufacturer (intel?).
with some extra overhead it's quite possible to have a shared
hashtable. All those things
i solved here on paper.
The real interesting thing is of course things like branches. How to
replace them with
near to branchless code?
That's not so easy. That simply requires overhead. In case of pattern
matching,
it's obvious that in x86 you can skip a lot of patterns easily cheap
by having a single
compare that fails. So that's a cycle or 30 you lose maximum.
Let's say probably a lot LESS cycles. 15 or so.
Not in gpgpu.
However as a compensation gpgpu's can push through more instructions.
Theoretical we can speak about a quadcore say i7 965:
3.2Ghz * 4 instructions a cycle * 4 cores = 51.2 G instructions per
second = 51.2GIPS
AMD : 320 cores @ 5 execution units = 1600 execution units * 0.97Ghz
* 1 instruction a cycle = 0.97 * 1600 = 1552 GIPS
NVIDIA : 480 streamcores * 1.5Ghz * 1 instruction a cycle = 720 GIPS
note sometimes nvidia counts double a cycle as there is multiply add
in floating point, so i don't want to say that
AMD is superior here to nvidia or vice versa, as it is a total
different architecture to start with and obviously AMD
is not capable of doing multiplication with all 5 units of each full
core; so the gflops count is something total different
here than what we do namely game tree search.
For me a factor 2 here is really not interesting to discuss, so i
really don't want to get into a discussion which of the 2
architectures is faster for game tree search at this moment. I might
when i get sponsored by one of the manufacturers though :)
The interesting thing is that to AMD the fastest intel quadcore chip
is say factor 30 slower on paper if we look to the raw GFLOP.
For all kind of gflop researches we know that the x86's really get
far over 90% efficiency and the best claim ever
of serious software that runs in huge numbers at gpu's, they claim
40% efficiency.
Scientists who claim therefore more than say factor 15 speed
difference of a gpu versus x86, they are not objective;
if not committing complete fraud on paper.
To my estimate it would be possible to do things at a factor 5 loss.
So you're 6 times faster then than a quadcore.
As i like numbers that are a multiple of 5 in this case, i round it
down to 5.
Whatever complex calculation i do taking into account all factors,
after months of study over a period of more than 2.5 nearly 3 years now,
i each time come down to factor 5, already for a number of years.
Achieving this is not simple. Let's first look whether it is possible
to also get a memory bandwidth of factor 5.
If we look to memory bandwidth this factor 6 you can sustain realtime
and nonstop 24/24 hours a day,
as the memory bandwidth claimed of the i7 is say 20-30GB/s and the
topend gpu's now definitely are over factor 5
faster in bandwidth. AMD has DDR5 there (not sure nvidia already has,
but if not then they'll follow soon), so to speak the gpu's
have 2 generations newer RAM than the quadcores. Just the difference
in RAM already means factor 4 faster for the gpu's,
multiply to that more memory controllers that they use than the cpu's
and fact that a lot gets done in a decentral manner at gpu's
whereas in cpu's there is a centralized L3 cache to the RAM.
So bandwidth wise seen it is quite possible to be practical 5 times
faster or so up to 10 times if you're a bit more optimistic.
However to get it out of the gpu's is not so easy for game tree
search. Optimizing low level is already really complicated.
Here you also need 2 layer parallel model at least and figure out
exactly where the chip can get a high IPC (number of
instructions per cycle) and how to get a high IPC done.
We have had now say 30 years of optimizations for processors in a
sequential manner with branches. Programming code
that's with nearly no branches is a total different league. Using the
many different instructions that implement conditional move
type instructions is really complicated.
In other sciences such as number theory where there is such software
that uses the utmost low level code vectorwise for SSE2+,
MMX you name it, we typical see that out of the millions of math
enthusiasts there is worldwide, only litterary a small handful knows
how to
really code it that low level and it takes years and years of
optimizatiosn to get it done.
In that sense it's a big learningproces for us all and only a few
will be ABLE to succeed for now, until it is common knowledge how to
get things done.
The big question is of course after having done so much effort to run
well at a gpu: how many of your program can you sell?
That is a very nasty question isn't it?
Please note that larrabee from intel is a different type of chip,
despite what intel wants you to believe.
To achieve the same thing what you can easily do on a gpu, you can
also do it on larrabee, just it uses bigger vectors.
So to start with the programming model is more complicated. Each
vector is 512 bits.
However to see each element as a separated 'cpu' keep in mind you
need indirect memory accesses. Gather is one of the 2 you need.
it has a huge latency. 7+ cycles and you need these instructions
nonstop.
So any claim that it could perform for stuff like game tree search in
the same manner is total nonsense. Any calculation of above therefore
doesn't apply to larrabee. See larrabee as a full vector processor
and the gpgpu's as manycores where indirect memory accesses are a lot
easier
to do and faster than at larrabee and you'll need these nonstop of
course at larrabee; if you wouldn't do them then each vector of 512
bits would need
to be 1 single 'go board'. Though in theory possible, that would mean
you only have a core or 10 to your avail, versus 3...@1600 or 480 at a
x2 gpu card.
I know a lot of quadcores that are going to easily beat 10 vector
cores for game tree search.
So porting game tree search to full vector processors is going to be
a lot tougher than gpgpu is my estimation. It's not impossible of
course. Just a lot harder
and it's more puzzle work of course.
Vincent
We have recently bought very high-end nVidia card at our university
department for trying out GPGPU and I'm thinking of a little
project for
myself, maybe...
I'm not much skilled in this kind of programming though, so I'm not
quite sure what the best design approach to take would be... my
current
ideas so far are either
(i) Have one game per pipeline, and in each cycle, take a board and
transform it by playing a random move (or possibly have an extra
cleanup
cycles if capturing stones would take too long); you should be able to
do some neat pattern matching and so too
(ii) Have one intersection per pipeline, in one cycle play a random
move, then have board_height cycles for captures and liberty updates
to ripple through to all the neighbors. The code would be much more
streamlined, but I'm not sure yet if there is a good way to do the
rippling - ORing bitmaps?
I guess there is a lot of things to try out.
--
Petr "Pasky" Baudis
The average, healthy, well-adjusted adult gets up at seven-thirty
in the morning feeling just terrible. -- Jean Kerr
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/