>
> We don't.  However, we could use software (the compiler/assembler) to do
> it.

So you need pack/unpack instruction and balanced there use, because it
take also one cycle. Or you need individualy selectable word inside the
SIMD register which produice big switch.

>> A vertex/fragment Shader must be apply to each vertex/fragment
>> individualy. So you could duplicate it every time you want. The
>> problem is very different from cpu. CPU are mainly design to increase
>>  a single thread traitement.
>>
>> Here, you want to optimise the ratio performance/cost. In hardware,
>> cost == area. A shader is mainly MAC operation and a  MAC unit is a
>> big unit.
>
> Yes.  Since much of the Shader math is going to be 4x4 matrix
> multiplies, that is what needs to be optimized.

No it's not. Read the code. There is a lot of DOT product (vector
operation not a matrix one) and a lot of scalar MUL.

>> So you must have the highest activity possible on each FPU. In the
>> best case, usefull values must be produiced by each FPU on every
>> clock cycle.
>
> Yes, that is what will happen when doing 4x4 matrix multiplies.  With 4
> pipelined Multiply Accumulate units, you will get 4 32 bit float outputs
> every 4 machine cycles if the pipeline is kept full.  This is the best
> you can do without more hardware.
>

You will not have only vector MUL instruction in a program.

>> With scalar code, this activity  could be very high. With SIMD
>> hardware and scalar code, the packing need a lot of effort. And i
>> don't think you could reach the activity of the scalar version.
>
> If the compiler does the packing, you will get the maximum possible.
> However I doubt that there will be enough to fill all 4 channels except
> when doing a 4x4 matrix multiply.  But, you don't want to slow things
> down just to have higher ALU utilization.  Yes, some of the hardware
> will be idle when the operations are other than 4x4 matrix multiplies,
> but I don't see that this is a problem.

It's a problem if you could use the 4 FPU in a different way that will
give you more performance.

If your shader code is full of scalar op, your vector cpu will pack/unpack
all the time the data. Or most of the time 3 FPU will be idle.

If you use 4 scalar core, the FPU will be used at every clock cycle.



_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to