[EMAIL PROTECTED] wrote:
Just like with Niagara, we have lots of opportunities to avoid control
and data hazzards, so we don't need to account for them.  (We may want
to have some locks in place, but we can afford to just stall.)  For
each pixel, even the effective memory read latency is smaller.
    

A "normal" cpu are design to handle interrupt and context switch, that's a
udge constraint for a lot of possible optimisation. GPU look more like DSP
with specific hardware.

>From the ARB documentation, we should provide this primitive :

      ABS            v       v        absolute value
      ADD            v,v     v        add
      ARL            s       a        address register load
      DP3            v,v     ssss     3-component dot product
      DP4            v,v     ssss     4-component dot product
      DPH            v,v     ssss     homogeneous dot product
      DST            v,v     v        distance vector
      EX2            s       ssss     exponential base 2
      EXP            s       v        exponential base 2 (approximate)
      FLR            v       v        floor
      FRC            v       v        fraction
      LG2            s       ssss     logarithm base 2
      LIT            v       v        compute light coefficients
      LOG            s       v        logarithm base 2 (approximate)
      MAD            v,v,v   v        multiply and add
      MAX            v,v     v        maximum
      MIN            v,v     v        minimum
      MOV            v       v        move
      MUL            v,v     v        multiply
      POW            s,s     ssss     exponentiate
      RCP            s       ssss     reciprocal
      RSQ            s       ssss     reciprocal square root
      SGE            v,v     v        set on greater than or equal
      SLT            v,v     v        set on less than
      SUB            v,v     v        subtract
      SWZ            v       v        extended swizzle
      XPD            v,v     v        cross product

That's mainly fp multiplication. So the design must be done to use the
FMUL at each cycle. Or we could choose to have a 2 cycle FMUL but a
smaller one, and use more core (the compiled code show a lots of MOV
instruction during the time).
>From an architecture standpoint why not go for 5 execution unit in one bloc.  On those 5 unit one is dedicated for memory management load store register and data mouvement, the other make a single execution unit for vector operation or 4 distinct unit for scalar operation. Each unit could do only a subset of the scalar operation, they don't all need to be able to do the same one, also it will help reduce the overal size of each unique execution bloc.

With such an architecture and since the code to run is rather small, it will probaly be possible to optimise the order for the operation for doing most stuff in parallel. Also since all the unit, work at the same time. We just need to define a rather large instruction memory on chip, it dosn't need to be deep since for first generation shader program couldn't depass 255 instruction for a basic program. So one instruction line feed a the time 5 operation. It look like a little bit like a dsp architecture. After that you could reproduce the meta bloc many time depending on the performance you want. But with more than one bloc you will need a kind of dispatcher(hardware or software with the driver) to divide the work. Since it is a small processor if you don't have to much dependancy betwen different instruction and you know the number of clock for execution you could have multiple instruction executing at the same time by pipelining the operation.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to