On Wed, Sep 23, 2009 at 3:30 AM, Nicolas Boulay <[email protected]> wrote: > 2009/9/23 Hugh Fisher <[email protected]>: >> Andre Pouliot wrote: >>> >>> Yes all the vector ops will be emitted as scalar ops. The program do get >>> longer but scalar ops can be made shorter since we have less instruction >>> to >>> support. >> >> SIMD instructions are the same length as scalar instructions on any >> sensible CPU: see MIPS and PowerPC. Four scalar MADD instructions >> are going to be four times longer than the equivalent vector MADD. >> >> (Uh, you *are* intending to use fixed length instructions, right? >> Please tell me you're not thinking of variable length opcodes?) >> > > Personnaly LIW is what i prefer : exposed every unit of the shader in > the instruction word. Then it became a software challenge to optimise > them.
Among other things, we would have the following ALU unit: - int add - int mul - combined div - fp mul - fp add - memory load/store - type conversions - flow control (really part of the fetch/decode) Let's say we made these four slots in the instruction word: - add (int or fp) - mul - flow control - other (memory, convert, div) We could in theory keep more units in the ALU busy and save a cycle on flow control. But how often would this be the case? I expect we'd very often fill most of the slots with NOOPs. This is a common challenge with VLIW architectures. (We like to call Itanium VLIW, but it's not. It's EPIC with several templates for how arbitrary instructions can be packed into a 128-bit word, with the unit of parallelism being arbitrarily long.) > > One other solution is having word aligned instructions. So you could > have 32, 64, 128 bits instructions size. > > If instruction code size matter, you could use such trick to reduce > the code size. Large instruction word is a must to have constant > embedded in the code and to save bandwith for the data. > > If you could use 64 bits to squeeze an LIW instruction word that > enable the use of 3 units in the same time, you could have more > compact code than using typical 32 bits RISC instructions. It's clear to me that if we're to make any such optimizations, the FIRST one would be vector instructions (even if the decode just converts them into a sequence of scalars). We know that many or most shader kernels are vector-heavy. > > <...> >> >>> Those optimization were to improve 3D rendering and scientific processing >>> on >>> a general purpose processor. You don't have the same requirement and >>> workload as a GPU. Different problem and context require different >>> solution. >> >> It's exactly the same requirements and workload! 3D vertices have to be >> multipled by a 4x4 transform matrix. Doesn't matter whether it's on the >> CPU or GPU. >> > > CPU don't use 1000th thread model at the same time, that the main > difference. It's far easier to get 1Tflops with many thread on many > core (with very relaxed memory coherency model) than with 2 or 4 > cores. > _______________________________________________ > Open-graphics mailing list > [email protected] > http://lists.duskglow.com/mailman/listinfo/open-graphics > List service provided by Duskglow Consulting, LLC (www.duskglow.com) > -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
