2009/9/14 Timothy Normand Miller <[email protected]>: > On Mon, Sep 14, 2009 at 5:09 PM, Nicolas Boulay > <[email protected]> wrote: > >>> The >>> problem is that memory accesses won't stream in ascending order or >>> even necessarily stay on the same memory row before moving on to >>> another memory row (and precharge+activate is expensive). If we could >>> automatically divide the dataflow graph in a shader program between >>> requestor and consumer, we could absorb the read latency and keep >>> accesses nicely ordered. The problem is program flows that depend on >>> data. Those aren't that common (in graphics, although they are in >>> supercomputing). If it's a simple if/then, then we can use >>> speculative execution. But if some piece of read data is used as an >>> address (for instance), then that creates a cycle in the dataflow >>> graph, which is hard to deal with in some cases. >> >> Usualy, 3D chip try to use tiling technic to reduce the memory >> bandwith to the minimum. > > I think they call this "sort first". "sort last" is the way we're > accustomed to where we render to arbitrary parts of the screen. > >>>> Larrabee use a >>>> specific instruction for the classical rasterisation. >>> >>> I read a paper on Larrabee rasterization where they discussed the >>> challenges faced by the fact that the did NOT have special >>> instructions just for rasterization. >>> >> >> So they use a core beside the cpu ? > > No. The only special hardware they have is texture decompression.
It's a kind of load/Store unit on steroid ? It's a must to avoid stupid bits move. > Everything else is using their vector instruction set. So they came > up with a clever way to do rasterization using their vector > instruction set while not adding any instructions _specific_ to > rasterization. That is, they were able to do a good job using just > general purpose vector instructions. > >>>> In that cases, the software should be written in the same time of the >>>> hardware to find the easiest way to acheive usable maximum >>>> performance. Usual shader are not the easiest path ! >>> >>> That's the beauty of prototyping in an FPGA. :) >>> >> >> I think that the challenge is not in the performance of the chip but >> how it's easy to use or not. >> > > I'm tempted to make an argument that pushes off complexity into the > compiler. But we've seen the negative effects of that a thousand > times over (e.g. Itanium, Common Lisp). > itanium use 2 or 3 bundle of 3 instructions. It's a lot compare to the average ipc of 3 for a typical code. A LIW with 2 or 3 instructions should not have the same issue. It could be a typical RISC pipeline, with the fpu pipeline. The idea is to have the fpu used every cycle. If you use massive SMT technics, aren't you afraid of having more muxes than fpu ? Nicolas > -- > Timothy Normand Miller > http://www.cse.ohio-state.edu/~millerti > Open Graphics Project > _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
