Re: [Beowulf] gpgpu

Mikhail Kuzminsky Thu, 28 Aug 2008 11:03:39 -0700

In message from "Li, Bo" <[EMAIL PROTECTED]> (Thu, 28 Aug 2008 14:20:15+0800):

...
Currently, the DP performance of GPU is not good as we expected, oronly 1/8 1/10 of SP Flops. It is also a problem.

AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/Wfor DP. It's 5 times slower than SP.

Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPSDP. The price will be, I suppose, about $2000 - as for 9170.

Let me look to modern dual socket quad-core beowulf node w/price about$4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DPperformance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100GFLOPS.

Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs.

Is it enough for essential calculation speedup - taking into accounttime for data transmission to/from GPU ?

I would suggest hybrid computation platforms, with GPU, CPU, andprocessors like Clearspeed. It may be a good topic for programmingmodel.

Clearspeed, if there is no new hardware now, has not enough DPperformance in comparison w/typical modern servers on quad-core CPUs.


Yours

Mikhail

Regards,
Li, Bo
----- Original Message -----From: "Vincent Diepeveen" <[EMAIL PROTECTED]>
To: "Li, Bo" <[EMAIL PROTECTED]>
Cc: "Mikhail Kuzminsky" <[EMAIL PROTECTED]>; "Beowulf"<beowulf@beowulf.org>
Sent: Thursday, August 28, 2008 12:22 AM
Subject: Re: [Beowulf] gpgpu
Hi Bo,

Thanks for your message.

What library do i call to find primes?
Currently it's searching here after primes (PRP's) in the form of p
= (2^n + 1) / 3

n is here about 1.5 million bits roughly as we speak.
For SSE2 type processors there is the George Woltman assembler code
(MiT) to do the squaring + implicit modulo;
how do you plan to beat that type of real optimized number crunching
at a GPU?
You'll have to figure out a way to find an instruction levelparallellism of at least 32,which also doesn't write to the same cacheline, i *guess* (nodocumentation to verify that in fact).
So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes
In fact the first problem to solve is to do some sort of squaringreal quickly.
If you figured that out at a PC, experience learns you're stilllosing a potential of factor 8,
thanks to another zillion optimizations.
You're not allowed to lose factor 8. that 52 gflop a gpu can deliver
on paper @ 250 watt TDP (you bet it will consume that
when you let it work so hard) means GPU delivers effectively lessthan 7 gflops double precision thanks to inefficient code.
Additionally remember the P4. On paper in integers claim was when it
released it would be able to execute 4 integers a
cycle, reality is that it was a processor getting an IPC far under 1
for most integer codes. All kind of stuff sucked at it.
The experience learns this is the same for todays GPU's, thescientists who have run codes on it so far and are reallyexperiencedCUDA programmers, figured out the speed it delivers is a very bigbummer.
Additionally 250 watt TDP for massive number crunching is too much.
It's well over factor 2 power consumption of a quadcore. Now i cantake a look soon in China myself what power prices
are over there, but i can assure you they will rise soon.
Now that's a lot less than a quadcore delivers with a tdp far under
100 watt.
Now i explicitly mention the n's i'm searching here, as it shouldfitwithin caches.So the very secret bandwidth you can practical achieve (as we knownvidia lobotomizedbandwidth in the GPU cards, only the Tesla type seems to be notlobotomized),
i'm not even teasing you with that.
This is true for any type of code. You're losing it to the details.
Only custom tailored solutions will work,
simply because they're factors faster.

Thanks,
Vincent

On Aug 27, 2008, at 2:50 AM, Li, Bo wrote:
Hello,
IMHO, it is better to call the BLAS or similiar libarary ratherthan programing you own functions. And CUDA treats the GPU as acluster, so .CU is not working as our normal codes. If you have gotto many matrix or vector computation, it is better to use Brook+/CAL, which can show great power of AMD gpu.
Regards,
Li, Bo
----- Original Message -----
From: "Mikhail Kuzminsky" <[EMAIL PROTECTED]>
To: "Vincent Diepeveen" <[EMAIL PROTECTED]>
Cc: "Beowulf" <beowulf@beowulf.org>
Sent: Wednesday, August 27, 2008 2:35 AM
Subject: Re: [Beowulf] gpgpu
In message from Vincent Diepeveen <[EMAIL PROTECTED]> (Tue, 26 Aug 2008
00:30:30 +0200):
Hi Mikhail,
I'd say they're ok for black box 32 bits calculations that can dowith
a GB or 2 RAM,
other than that they're just luxurious electric heating.
I also want to have simple blackbox, but 64-bit (Tesla C1060 or
Firestream 9170 or 9250). Unfortunately the life isn't restricted to
BLAS/LAPACK/FFT :-)

So I'll need to program something other. People say that the best
choice is CUDA for Nvidia. When I look to sgemm source, it hasabout 1thousand (or higher) strings in *.cu files. Thereofore I think thatabit more difficult alghorithm as some special matrixdiagonalization
will require a lot of programming work :-(.
It's interesting, that when I read Firestream Brook+ "kernelfunction"
source example - for addition of 2 vectors ("Building a High Level
Language Compiler For GPGPU",
Bixia Zheng ([EMAIL PROTECTED])
Derek Gladding ([EMAIL PROTECTED])
Micah Villmow ([EMAIL PROTECTED])
June 8th, 2008)

- it looks SIMPLE. May be there are a lot of details/source lines
which were omitted from this example ?
Vincent
p.s. if you ask me, honestely, 250 watt or so for latest gpu isreally
too much.
250 W is TDP, the average value declared is about 160 W. I don't
remember, which GPU - from AMD or Nvidia - has a lot of special
functional units for sin/cos/exp/etc. If they are not used, may bethe
power will a bit more lower.
What is about Firestream 9250, AMD says about 150 W (although I'mnot
absolutely sure that it's TDP) - it's as for some
Intel Xeon quad-cores chips w/names beginning from X.

Mikhail
On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:
BTW, why GPGPUs are considered as vector systems ?
Taking into account that GPGPUs contain many (equal) execution
units,
I think it might be not SIMD, but SPMD model. Or it depends from
the software tools used (CUDA etc) ?

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] gpgpu

Reply via email to