On Fri, Nov 30, 2012 at 01:57:27PM -0500, Mark Hahn wrote: > well, cuda exposes a quite different programming model; > I'm not sure they can do much about the generational differences > they expose. (say, generational differences in what kind of atomic > operations are supported by the hardware.) many of the exposures > are primarily tuning knobs (warp width, number of SMs, cache sizes, > ratio of DP units per thread.) a very high-level interface like OpenACC > doesn't expose that stuff - is that more what you're looking for? > there's no denying that you get far less expressive power...
Absolutely. CUDA is a lot like assembler that way, and assembler has been almost completely displaced by low-level but hardware-independant languages like C. You can't tune as much in OpenCL, but on the other hand, you don't have to. The achievable performance is lower, but more uniform across diverse platforms. The JIT knows the hardware, so that you don't have to. Parallelism is hard enough as is, so nothing wrong with a little set of training wheels. >>> stacking is great, but not that much different from MCMs, is it? >> >> Real memory stacking a la TSV has smaller geometries, way more >> wire density, lower power burn, and seems to boost memory bandwidth >> by one order of magnitude > > sorry, do you have some reference for this? what I'm reading is that Not really. I'm increasingly out of my depth here, and happy to be able to learn from you. I have theoretical reasons to believe that TSV is the next best thing to real 3d integration, though already off-Moore, since assembled from discrete components, which are on-Moore (but Moore has recently ended, anyway). This is corraborated from ad hoc Googling, so I'm happy if we'll be able to get to those promised 10^5 via density/die eventually. There's really no other way to feed these kilocores/die we'll be getting other than by a very wide bus to memory. Eventually, if MRAM can be deposited direclty on top of logic you'll effectively have a >10^9 wide bus to your memory. > TSV and chip-on-chip stacking is fine, but not dramatically different > from chip-bumps (possibly using TSV) connecting to interposer boards. A6 has about 8.5 GByte/s memory bandwidth, while Micron demonstrated 128 GByte/s. That will feed a reasonably powerful GPU already, so it should be more than enough for these ARM GPUs. > obviously, attaching chips to fine, tiny, low-impedence, wide-bus > interposers gives a lot of flexibility in designing packages. > >> http://nepp.nasa.gov/workshops/etw2012/talks/Tuesday/T08_Dillon_Through_Silicon_Via.pdf > > that's useful, thanks. it's a bit high-end-centric - no offence, but > NASA and high-volume mass production are not entirely aligned ;) > > it paints 2.5d as quite ersatz, but I didn't see a strong data argument. > sure, TSVs will operate on a finer pitch than solder bumps, but the > xilinx silicon interposer also seems very attractive. do you actually > get significant power/speed benefits from pure chip-chip > contacts versus an interposer? I guess not: that the main win is > staying in-package. I've seen some ~um scale polymer fiber which can easily link adjacent ~10 um spaced dies >5 TBit/s so there's plenty of air in the interconnect space still. http://www.heise.de/newsticker/meldung/Optische-Chip-zu-Chip-Verbindung-mit-Polymerfasern-1715192.html?view=zoom;zoom=1 http://www.heise.de/newsticker/meldung/Optische-Chip-zu-Chip-Verbindung-mit-Polymerfasern-1715192.html > it is interesting to think, though: if you can connect chips with > extremely wide links, does that change your architecture? for instance, > dram is structured as 2d array of bit cells that are read out into a 1d > slice (iirc, something like 8kbits). cpu r/w requests are satisfied > from within this slice faster since it's the readout from 2d that's > expensive. but suppose a readout pumped all 8kb to the cpu - sort of a > cache line 16x longer than usual. considering the proliferation > of 128-512b-wide SIMD units, maybe this makes perfect sense. this would > let you keep vector fetches from flushing all the non-vector stuff out > of your normal short-line caches... Long time ago, when I was young and even more stupid than today I wrote http://www.enlight.ru/docs/arch/uliw.txt which I think has aged quite well. _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
