> On 4/20/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >> >> > On 4/20/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >> > >> >> I don't think it's wise to use SIMD ALU here. All scalar code will >> use >> >> the >> >> SIMD FPU with 3 FMUL unit idle. Because everything is strongly >> >> parrallel, >> >> i think it's better to stay scalar. >> > >> > If there are enough independent scalars that can be scheduled, you can >> > pack them and run them in parallel. >> > >> >> So you need the logic to detect that a pack is possible, and you need >> the >> switch that permit to connect the different register bank and the FPU. > > > No. The compiler can optimize this stuff.
So you want to pack and unpack SIMD register. That's a cuting edge technology very few used in normal computing (using SSE2 or altivec,...). Compiler are quite bad at it. But if you optimise the pack/unpack instruction, this will represente switch that are big and slow. FMUL is a one cycle operation. If you need 3 pack instruction before, there is no use of it. > > Have a look at the instruction set you came up with; 19 out of 27 ops are > vector ops. It is going to be far more important to optimize vector ops > than to ensure full utilisation of the silicon at all times. See the real code posted here ! Most operation are purely scalar. > If you can do > a dot product in 10 instructions (8 load, 1 vmul, 1 store) then that is a > big gain over 17 instructions (8 load, 4 smul, 3 add, 1 store). If you > have > a wide memory bus and can fetch four floats in one op, so much the better; > now we are down to 4 instructions instead of 17. Vector operations are > very > likely to dominate a shader (any 3D processing for that matter) therefore > the whole architecture should have the goal of optimising vector > operations. If i have understand shader correctly, load are only for texture, every thing else is transmited trough specific register. So this load are implicit. Basicaly DOT product take one cycle in vector arch, and 4 in a scalar LIW arch (like I and André Pouliot explain) if you could interleave correctly the instruction (7 instructions latency otherwise, with a 3 cycles latency FPU). So for DOT it's completly the same. Because a scalar cpu will be almost 4 times smaller than a vector shader, you could put 4 cores scalar where you put 1 vector core. When you look at the compiled code posted here, you see a lot of MOV, scalar MUL, etc... In this precise case, the SIMD unit is completly underused. I don't have access here to the ASM published. But this is the kind of code to optimise. I know that if only vector code is used, a vector shader will be faster (because there is less problem of read-after-write dependancies than in scalar cpu), but shader use a lot of scalar code. > > Tom > _______________________________________________ > Open-graphics mailing list > [email protected] > http://lists.duskglow.com/mailman/listinfo/open-graphics > List service provided by Duskglow Consulting, LLC (www.duskglow.com) _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
