Ian Romanick wrote (in the GL_ARB_texture_env_crossbar on r200 thread):
> Other optimizations are possible, but I never explored them. Most of
> the ones that I could think of are probably unlikely in practice.
> Doing things like replacing {T1 + T2}, {P + P}, {P + T3} with {T1 +
> T2}*2, {P + T3}, or replacing {T1 * T2}, {P + T0} with {T1 * T2 + T0}
> are possible, but probably not worth the effort.
Thought about it if it would be really worthwile to optimize for shorter
"shader programs", and figured I'd really need some performance figures
up-front (there's no point implementing some super-advanced optimizer if
it turns out the shader won't run faster anyway...)
So, I hacked the driver a bit and gathered some numbers.
What I did was mostly always enable (and emit) a fixed number of the
pixel shader stages (e.g. change the R200_TEX_BLEND_x_ENABLE bits).
Quake III performance, with lightmaps (so most things are dual-textured,
some remain single-textured), hyperz, compressed textures, 1024x768x32,
demo four:
(enable 4 tex env for instance means pix shader stages 0-3 are always
enabled, with 2 and more the rendering output is always correct, with 1
light maps are missing and everything is too bright, with 0 there are a
couple more errors)
normal code: 149 fps
enable 0 tex env: 150 fps
enable 1 tex env: 150 fps
enable 2 tex env: 146 fps
enable 3 tex env: 122 fps
enable 4 tex env: 97 fps
enable 5 tex env: 79 fps
enable 6 tex env: 66 fps
And, to see if it makes a difference WHAT stages are enabled, only tex
blend stage 5 enabled:
texenv 5 only: 67 fps
The same with vertex lighting (so everything is single-textured):
normal code: 227 fps
0 tex env: 227 fps
1 tex env: 227 fps
2 tex env: 211 fps
3 tex env: 167 fps
4 tex env: 127 fps
5 tex env: 103 fps
6 tex env: 87 fps
The numbers were interesting but not quite conclusive enough, so a
couple more with the Mesa multiarb demo (textures always enabled
in-order in the demo):
multiarb 2 textures:
0 tex env: 230 fps
2 tex env: 227 fps
3 tex env: 210 fps
4 tex env: 191 fps
6 tex env: 162 fps
multiarb 3 textures:
0 tex env: 209 fps
2 tex env: 210 fps
3 tex env: 200 fps
4 tex env: 191 fps
6 tex env: 162 fps
multiarb 4 textures:
0 tex env: 191 fps
2 tex env: 191 fps
3 tex env: 191 fps
4 tex env: 187 fps
6 tex env: 162 fps
and finally some tests to see if it makes a difference what texture
sampling stages (as opposed to the blending stages) are enabled, using a
modified multiarb (with GL_REPLACE, and the same texture for the 1st and
4th texture, the "hacked" result means this used the 4th texture mapping
unit, but the driver was hacked to use the 1st blending stage instead of
the 4th).
multiarb (*) 1st tex:
1 tex env: 255 fps
4 tex env: 191 fps
normal: 257 fps
multiarb (*) 4th tex:
1 tex env: 255 fps
4 tex env: 191 fps
normal: 191 fps
hacked: 257 fps
So, conclusions: long shader programs indeed can have a (sometimes
drastic - see the quake3 results) performance impact. However, it looks
like you get about as many instructions basically for free as you use
texturing units. Since with standard GL you have as many texture
blending stages as you use texture mapping units, optimizing doesn't
really seem to be worthwile (unless you actually don't need some texture
lookups for the final result and could disable texture sampling for that
unit). And, the results MAY be different for a r200 (instead of the
rv250 I used), since it can sample 2 textures per clock as opposed to 1
(though if I'm not mistaken it is restricted to bilinear otherwise it
needs 2 cycles whereas rv250 can do trilinear in one cycle), but has the
same arithmetic throughput than the rv250, meaning it might not hide the
arithmetic instructions so well.
An optimization which would however have benefits would be to always use
pix shader stages in-order - the time the chip needs to perform the
calculations does not seem to depend on the number of stages enabled at
all, but only the highest stage enabled (which is clearly not the case
for texture sampling, there it doesn't seem to matter which units are
used - again with r200 the results may be different, the 2 texturing
units seem to be somewhat pair-wise arranged). I am not too sure though
such code is common (I believe most apps usually use the texturing units
in-order).
On a somewhat unrelated note, I was surprised to see a much larger
performance difference in quake3 than multiarb (as multiarb does
basically nothing but texturing, but quake3 also uses z-buffer etc.).
Roland
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
--
_______________________________________________
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel