Re: [Beowulf] Vector coprocessors

Joe Landman Thu, 16 Mar 2006 07:25:17 -0800


Jim Lux wrote:

At 12:04 AM 3/16/2006, Daniel Pfenniger wrote:
The shipment of this accelerator card has been delayed many times.Last timeI asked was October 2005. Apparently the first shipment has beenmade this
month for a Japanese supercomputer with 10^4 Opterons.   The cost is not
indicated, but something like above $8000.- per card would put it outside
commodity hardware.  I wouldn't be astonished that more performance can
be obtained in most applications with commodity clustering.

I think under 10k$ keeps it commodity (read as what most managers couldlikely sign for themselves without needing to walk the approval ladder).

There are probably applications where a dedicated card can blow thedoors off a collection of PCs. At some point, the interprocessorcommunication latency inherent in any sort of cabling between processorswould start to dominate.

There are numerous such examples in life sciences, in chemistry, andother areas. Such cards are not universal, they cannot be viewed asgeneral purpose processors. You have to view them as dedicated attachedprocessors.

The Clearspeed cards have 2 of their co-processors. Each has 96 FPunits. I believe the architecture is a systolic array. To program themat a high level, you have a C variant that you can use, or you can handcode assembly. The latter is hard.

The issue for these cards are the memory bandwidth in and out of thePCI-x based interface. There are tricks you can play for a welldesigned system, but you cannot escape the bandwidth ceiling of PCI-x.For many algorithms of potential interest to this list, memory bandwidthis as important as FP performance. Having effectively 100 processors onthe far side of a narrow pipe means you have to design algorithms withthat pipe width in mind.

If Clearspeed would consider mass production with a cost like$100.-$500.-per card the market would be huge, because the card would be competingwith
multi-core processors like the IBM-Sony Cell.

Kahan had some interesting things to say about the Cell. Summarizedlike this. You get to choose one with Cell: Fast or Accurate. He wasmaking this point in general but pointed out some issues. This is froma talk on his web site. Caveat: I don't have a cell to play with (yesSanta, I would like 1 or 2 hundred), so I can't run paranoia or otherfun tests.

You need "really big" volumes to get there. Retail pricing of $200implies a bill of materials cost down in the sub $20 range.

Yup. Volume drives lower pricing. Economies of scale matter. This iswhy FPGAs are where they are price wise. They don't have large volumes.If they did, pricing should be better.

Consideringthat a run of the mill ASIC spin costs >$1M (for a small number of partsproduced), your volume has to be several hundred thousand (or a million)before you even cover the cost of your development.
The video card folks can do this because
a) each successive generation of cards is derived from the past, so theNRE is lower.. most of the card (and IC) is the same

I believe they are in incremental improvement mode. This keeps redesigncosts way down.

b) they have truly gargantuan volumes

This is the critical thing. Remember, these are highly pipelinedgraphical supercomputers. The ClawHMMer project ran a hardwareaccelerated HMMer on an nVidia GT6800 5x faster than the P4 hosting thecard.

c) they have sales from existing products to provide cash to support thedevelopment of version N+1.


Cash is king.

{I leave aside the possibility of magic elves, although with someconsumer products, I have no idea how they can design, produce, and sellit at the price they do. Making use of relative currency values canalso help, but that's in the non-technological magic elf category, asfar as I'm concerned.}

Actually lots of stuff is done outside the US these days. Not magicelves per se, but Indian and Chinese engineers and scientists who areextremely good at what they do. This starts getting into a cost andproductivity discussion rather rapidly.

The possibly most interesting niche for the Clearspeed cards appearsto meaccelerating proprietary applications like Matlab, Mathematica andparticularlyExcel that run on a single PC and that can hardly be reprogrammed bytheir
users to run on a distributed cluster.
I would say that there is more potential for a clever soul to reprogramthe guts of Matlab, etc., to transparently share the work acrossmultiple machines. I think that's in the back of the mind of MS, asthey move toward a services environment and .NET

:)

So imagine if you will an LD_PRELOAD environment variable whichpoints a users code over to the relevant libraries which work theirmagic behind the scenes. I would be hard pressed to imagine using thisfor Excel, but could see it for Matlab. Programming at high levels withhigh performance. Of course Kahan also rips into them over accuracy ...

Jim



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Vector coprocessors

Reply via email to