2009/9/23 Andre Pouliot <[email protected]>: > > > 2009/9/23 Nicolas Boulay <[email protected]> >> >> 2009/9/23 Kenneth Ostby <[email protected]>: >> > Nicolas Boulay: >> >>2009/9/23 Kenneth Ostby <[email protected]>: >> >>> Hi, >> >>> >> >>> Nicolas Boulay: >> >>>>2009/9/23 Hugh Fisher <[email protected]>: >> >>>>> Andre Pouliot wrote: >> >><...> >> >> >> >>I doesn't understand your point. That means that the ALU will be full >> >>but the other unit will be unused ? for example adder and >> >>multiplication could be a separat unit, both could be filled in the >> >>same time (MAC instead of MUL + an adder should be better). >> > >> > Aaah, the joy of terminology. If you take a look at the shader unit >> > figure in [1], you can see how we plan to have several ALUs in a single >> > shader. All those ALUs will execute the same instruction in over >> > different threads. Thus, exposing the ALUs for the software developer >> > only adds more complexity on both the hardware and software. Futhermore, >> > the software side will in most cases only have to duplicate the same >> > instruction over several ALUs. >> > >> >> For me it's the definition of SIMD code. How do you deal with >> branchies ? You execute both branches and one is discared ? If you >> used masked vector it's looks very like the new 512 bits vector >> instruction from intel and larrabee (avx ?). > > > The Microarchitecture look is the same as SIMD. The main difference is the > data being processed by each ALU is independent from the other. >
It's very like the SIMD of larrabee that use vectorised load/store (a different adresse for each vector value) For more information : http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf p51 > For the branch management we have up to now proposed a few solutions: > minimum and maximum operation to compare two value and select the > corresponding result. The other amelioration is a loop unit it should help > for small code loop who are of a fixed length. > Both of those should help reduce the branch. Still if there is some that do > happen there will be a caching mechanism. The Kernel will continue to > execute the threads that correspond to the first branch. once the kernel is > finished executing the kernel will go back in the program to the previous > branch to finish executing the remaining threads. > So you want some vectorised conditionnal move, and some "Conditional vector masks", like Intel for larrabee :) (it's a kind of predicat) (VPU is the name of this instruction set) If you add loops controls inside the ALU it became hard to programme it. Do you want a kind of repeat "n" times for an instruction like in DSP (but dsp have memory to ALU adressing mode). That became complexe. >> >> > That being said, after having finished my coffee, and had some time to >> > think, we might be able to utilize LIW, although I'm still unsure about >> > the cost to benefit ratio. Imagine if we, in what we call the ALUs, >> > include several functional units, adders, multipliers, &c. we can use >> > LIW in order to fully utilize them. However, this comes with the added >> > cost of logic, and design complexity. The simple way to solve this could >> > be to add a single multiply-adder unit inside each ALU, and thus we >> > avoid the LIW problem altogether. >> > >> >> An x86 instruction use 2 registers adresse, 1 for reading, 1 for >> read/write. It's compact but fast only with register renaming. >> Typical RISC operation is 2r1w, 3 adresses, 2 read, one write. MAC >> operation is 3 read 1 write. >> >> An LIW could be seens as 6 reads, 3 writes execution unit. > > The most frequent instruction will be one with 3 operand 2 read and 1 > write. The instruction will be fixed length we don't want to fall in the > trap of variable length instruction. Also the instruction will not contain > any constant, those will be in the constant register. In this list many years ago, some statistics where display on the use of combined instruction. The more common was "ADD ADD" and "ADD MUL". So 3r1w should be used, at least for ADDMULL (a*b+c), ADDADD could be also use often (a+b+c). An adder using 3 inputs should be smaller than 2 simpler adder unit? >> >> >From your terminology, it's look like an ALU with a lot of register >> port. (for exemple MAC/MUL unit, beside load/store, beside complete >> ALU without MUL) > > > The ALU is split from the register file for simplicity sake. Still one > feature of the ALU is that it will be completely integrated. What I mean is > the same ALU will do integer and float operation. That ALU will be split in > different stage that will allow a reuse of the different logic bloc. For > example the multiplier will serve to do the multiplication of all data type > also it will act as a barrel shifter. A barrel shifter is needed for both > the float add and also if we want to realize shift bigger than 1 bit per > clock. It's hard to know if its better to have one more ALU or splitted ALU more complexe but with a better use rate. > The only part that may be not integrated will be the divider it's status is > still unclear at the moment. We know we need one but we aren't sure of how > to implement it and how the kernel will reaction to that instruction. We > have 2 main choice either it a serial divider or a pipelined divider. The > serial divider advantage is it's relatively small size, the downside is it > can only process 1 request at once and take multiple cycle. The pipelined > divider advantage we could probably realize 1 division per cycle but the > problem would be the size requirement. You should only implement 1 cycle operation. If you really need div, pipeline (1/x) with MUL with enough garded bit to have the required precision. There is a lots of 1 cycle operation for complexe function (1/x, 1/sqrt(x)). _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
