> Just like with Niagara, we have lots of opportunities to avoid control
> and data hazzards, so we don't need to account for them.  (We may want
> to have some locks in place, but we can afford to just stall.)  For
> each pixel, even the effective memory read latency is smaller.

A "normal" cpu are design to handle interrupt and context switch, that's a
udge constraint for a lot of possible optimisation. GPU look more like DSP
with specific hardware.

>From the ARB documentation, we should provide this primitive :

      ABS            v       v        absolute value
      ADD            v,v     v        add
      ARL            s       a        address register load
      DP3            v,v     ssss     3-component dot product
      DP4            v,v     ssss     4-component dot product
      DPH            v,v     ssss     homogeneous dot product
      DST            v,v     v        distance vector
      EX2            s       ssss     exponential base 2
      EXP            s       v        exponential base 2 (approximate)
      FLR            v       v        floor
      FRC            v       v        fraction
      LG2            s       ssss     logarithm base 2
      LIT            v       v        compute light coefficients
      LOG            s       v        logarithm base 2 (approximate)
      MAD            v,v,v   v        multiply and add
      MAX            v,v     v        maximum
      MIN            v,v     v        minimum
      MOV            v       v        move
      MUL            v,v     v        multiply
      POW            s,s     ssss     exponentiate
      RCP            s       ssss     reciprocal
      RSQ            s       ssss     reciprocal square root
      SGE            v,v     v        set on greater than or equal
      SLT            v,v     v        set on less than
      SUB            v,v     v        subtract
      SWZ            v       v        extended swizzle
      XPD            v,v     v        cross product

That's mainly fp multiplication. So the design must be done to use the
FMUL at each cycle. Or we could choose to have a 2 cycle FMUL but a
smaller one, and use more core (the compiled code show a lots of MOV
instruction during the time).

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to