> > We don't. However, we could use software (the compiler/assembler) to do > it.
So you need pack/unpack instruction and balanced there use, because it take also one cycle. Or you need individualy selectable word inside the SIMD register which produice big switch. >> A vertex/fragment Shader must be apply to each vertex/fragment >> individualy. So you could duplicate it every time you want. The >> problem is very different from cpu. CPU are mainly design to increase >> a single thread traitement. >> >> Here, you want to optimise the ratio performance/cost. In hardware, >> cost == area. A shader is mainly MAC operation and a MAC unit is a >> big unit. > > Yes. Since much of the Shader math is going to be 4x4 matrix > multiplies, that is what needs to be optimized. No it's not. Read the code. There is a lot of DOT product (vector operation not a matrix one) and a lot of scalar MUL. >> So you must have the highest activity possible on each FPU. In the >> best case, usefull values must be produiced by each FPU on every >> clock cycle. > > Yes, that is what will happen when doing 4x4 matrix multiplies. With 4 > pipelined Multiply Accumulate units, you will get 4 32 bit float outputs > every 4 machine cycles if the pipeline is kept full. This is the best > you can do without more hardware. > You will not have only vector MUL instruction in a program. >> With scalar code, this activity could be very high. With SIMD >> hardware and scalar code, the packing need a lot of effort. And i >> don't think you could reach the activity of the scalar version. > > If the compiler does the packing, you will get the maximum possible. > However I doubt that there will be enough to fill all 4 channels except > when doing a 4x4 matrix multiply. But, you don't want to slow things > down just to have higher ALU utilization. Yes, some of the hardware > will be idle when the operations are other than 4x4 matrix multiplies, > but I don't see that this is a problem. It's a problem if you could use the 4 FPU in a different way that will give you more performance. If your shader code is full of scalar op, your vector cpu will pack/unpack all the time the data. Or most of the time 3 FPU will be idle. If you use 4 scalar core, the FPU will be used at every clock cycle. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
