> Just like with Niagara, we have lots of opportunities to avoid control
> and data hazzards, so we don't need to account for them. (We may want
> to have some locks in place, but we can afford to just stall.) For
> each pixel, even the effective memory read latency is smaller.
A "normal" cpu are design to handle interrupt and context switch, that's a
udge constraint for a lot of possible optimisation. GPU look more like DSP
with specific hardware.
>From the ARB documentation, we should provide this primitive :
ABS v v absolute value
ADD v,v v add
ARL s a address register load
DP3 v,v ssss 3-component dot product
DP4 v,v ssss 4-component dot product
DPH v,v ssss homogeneous dot product
DST v,v v distance vector
EX2 s ssss exponential base 2
EXP s v exponential base 2 (approximate)
FLR v v floor
FRC v v fraction
LG2 s ssss logarithm base 2
LIT v v compute light coefficients
LOG s v logarithm base 2 (approximate)
MAD v,v,v v multiply and add
MAX v,v v maximum
MIN v,v v minimum
MOV v v move
MUL v,v v multiply
POW s,s ssss exponentiate
RCP s ssss reciprocal
RSQ s ssss reciprocal square root
SGE v,v v set on greater than or equal
SLT v,v v set on less than
SUB v,v v subtract
SWZ v v extended swizzle
XPD v,v v cross product
That's mainly fp multiplication. So the design must be done to use the
FMUL at each cycle. Or we could choose to have a 2 cycle FMUL but a
smaller one, and use more core (the compiled code show a lots of MOV
instruction during the time).
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)