2009/9/23 Nicolas Boulay <[email protected]> > 2009/9/23 Kenneth Ostby <[email protected]>: > > Nicolas Boulay: > >>2009/9/23 Kenneth Ostby <[email protected]>: > >>> Hi, > >>> > >>> Nicolas Boulay: > >>>>2009/9/23 Hugh Fisher <[email protected]>: > >>>>> Andre Pouliot wrote: > >><...> > >>>> > >>>>Personnaly LIW is what i prefer : exposed every unit of the shader in > >>>>the instruction word. Then it became a software challenge to optimise > >>>>them. > >>> > >>> I'm unsure if LIW is the good option for this architecture. This due to > >>> the fact that Andre mentioned earlier, we have a lot of threads that > >>> needs to to execute the same instruction over data in close spatial > >>> locality. Hence, there is really no use in having fine grained control > >>> over the different units in a single shader, since in most cases they > >>> are going to execute the same instruction anyways. Thus, including LIW > >>> will only increase the complexity of the hardware, without providing > any > >>> substantial gains. > >>> > >> > >>I doesn't understand your point. That means that the ALU will be full > >>but the other unit will be unused ? for example adder and > >>multiplication could be a separat unit, both could be filled in the > >>same time (MAC instead of MUL + an adder should be better). > > > > Aaah, the joy of terminology. If you take a look at the shader unit > > figure in [1], you can see how we plan to have several ALUs in a single > > shader. All those ALUs will execute the same instruction in over > > different threads. Thus, exposing the ALUs for the software developer > > only adds more complexity on both the hardware and software. Futhermore, > > the software side will in most cases only have to duplicate the same > > instruction over several ALUs. > > > > For me it's the definition of SIMD code. How do you deal with > branchies ? You execute both branches and one is discared ? If you > used masked vector it's looks very like the new 512 bits vector > instruction from intel and larrabee (avx ?). >
The Microarchitecture look is the same as SIMD. The main difference is the data being processed by each ALU is independent from the other. For the branch management we have up to now proposed a few solutions: minimum and maximum operation to compare two value and select the corresponding result. The other amelioration is a loop unit it should help for small code loop who are of a fixed length. Both of those should help reduce the branch. Still if there is some that do happen there will be a caching mechanism. The Kernel will continue to execute the threads that correspond to the first branch. once the kernel is finished executing the kernel will go back in the program to the previous branch to finish executing the remaining threads. > > That being said, after having finished my coffee, and had some time to > > think, we might be able to utilize LIW, although I'm still unsure about > > the cost to benefit ratio. Imagine if we, in what we call the ALUs, > > include several functional units, adders, multipliers, &c. we can use > > LIW in order to fully utilize them. However, this comes with the added > > cost of logic, and design complexity. The simple way to solve this could > > be to add a single multiply-adder unit inside each ALU, and thus we > > avoid the LIW problem altogether. > > > > An x86 instruction use 2 registers adresse, 1 for reading, 1 for > read/write. It's compact but fast only with register renaming. > Typical RISC operation is 2r1w, 3 adresses, 2 read, one write. MAC > operation is 3 read 1 write. > > An LIW could be seens as 6 reads, 3 writes execution unit. > The most frequent instruction will be one with 3 operand 2 read and 1 write. The instruction will be fixed length we don't want to fall in the trap of variable length instruction. Also the instruction will not contain any constant, those will be in the constant register. > >From your terminology, it's look like an ALU with a lot of register > port. (for exemple MAC/MUL unit, beside load/store, beside complete > ALU without MUL) > The ALU is split from the register file for simplicity sake. Still one feature of the ALU is that it will be completely integrated. What I mean is the same ALU will do integer and float operation. That ALU will be split in different stage that will allow a reuse of the different logic bloc. For example the multiplier will serve to do the multiplication of all data type also it will act as a barrel shifter. A barrel shifter is needed for both the float add and also if we want to realize shift bigger than 1 bit per clock. The only part that may be not integrated will be the divider it's status is still unclear at the moment. We know we need one but we aren't sure of how to implement it and how the kernel will reaction to that instruction. We have 2 main choice either it a serial divider or a pipelined divider. The serial divider advantage is it's relatively small size, the downside is it can only process 1 request at once and take multiple cycle. The pipelined divider advantage we could probably realize 1 division per cycle but the problem would be the size requirement. > > > > >> > >>>> > >>>>One other solution is having word aligned instructions. So you could > >>>>have 32, 64, 128 bits instructions size. > >>> > >>> Before we decide on the length of the instruction, it would be fun to > >>> further investigate some stuff from real life. And this is where we can > >>> benefit from some of the software dudes out there. I would like to see > >>> how big the average shader code is, compared to the available memory we > >>> have on the underlying technology. Cause due to my initial calculations > >>> here, if we assume 32'000 instructions in a kernel( Which from what I > >>> have seen is a lot ), we use about 250KB [1] to store it using 64 bit > >>> instruction words. That also leaves us with a lot of flexibility in > the > >>> instruction word, and the decoding should really not be that hard > >>> either. However, depending on the underlying technology, 250KB might be > >>> a lot of RAM. > The standard ram block size for spartan is 18,432 Kbits for 32 bits instruction it would be 512 instruction by BRAM. For 32000 instruction we would need 64 of them... >> > >>I hope you could put more than a single RISC instruction on 64 bits ! > >>If you take 3 "basic" instructions in 64 bits. You should divide your > >>result by 3. > > > > Yup, I haven't been thinking a lot about how to structure the ISA yet, > > and of-course, using 64 bits for a RISC-ish ISA is waste of space. The > > 64 bit was just to get an example of a worst-case kernel size. However, > > it would still be interesting to get some metric on the average shader > > size though, so we can get a better feeling of how big real-world > > programs are. > > > > [1] > > > http://docs.google.com/View?id=dfsp4qpd_41dtrrskfb#Specification_for_Shaders_9367_2463043036062943 > > > > -- > > Life on the earth might be expensive, but it > > includes an annual free trip around the sun. > > > > Kenneth Østby > > http://langly.org > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.9 (GNU/Linux) > > > > iEYEARECAAYFAkq5+jUACgkQpcFZhY+Vljx4dACfQ83XLoHPa2E4OQs3Lk+2DFC6 > > hygAmwXz76ZBT/2N591rTjzhQsISzYQw > > =a7Nv > > -----END PGP SIGNATURE----- > > > > > _______________________________________________ > Open-graphics mailing list > [email protected] > http://lists.duskglow.com/mailman/listinfo/open-graphics > List service provided by Duskglow Consulting, LLC (www.duskglow.com) >
_______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
