Ian Romanick wrote (in the GL_ARB_texture_env_crossbar on r200 thread):
> Other optimizations are possible, but I never explored them.  Most of
>  the ones that I could think of are probably unlikely in practice.
> Doing things like replacing {T1 + T2}, {P + P}, {P + T3} with {T1 +
> T2}*2, {P + T3}, or replacing {T1 * T2}, {P + T0} with {T1 * T2 + T0}
> are possible, but probably not worth the effort.

Thought about it if it would be really worthwile to optimize for shorter "shader programs", and figured I'd really need some performance figures up-front (there's no point implementing some super-advanced optimizer if it turns out the shader won't run faster anyway...)

So, I hacked the driver a bit and gathered some numbers.
What I did was mostly always enable (and emit) a fixed number of the pixel shader stages (e.g. change the R200_TEX_BLEND_x_ENABLE bits). Quake III performance, with lightmaps (so most things are dual-textured, some remain single-textured), hyperz, compressed textures, 1024x768x32, demo four: (enable 4 tex env for instance means pix shader stages 0-3 are always enabled, with 2 and more the rendering output is always correct, with 1 light maps are missing and everything is too bright, with 0 there are a couple more errors)
normal code: 149 fps
enable 0 tex env: 150 fps
enable 1 tex env: 150 fps
enable 2 tex env: 146 fps
enable 3 tex env:  122 fps
enable 4 tex env:  97 fps
enable 5 tex env:  79 fps
enable 6 tex env:  66 fps

And, to see if it makes a difference WHAT stages are enabled, only tex blend stage 5 enabled:
texenv 5 only: 67 fps

The same with vertex lighting (so everything is single-textured):
normal code: 227 fps
0 tex env: 227 fps
1 tex env: 227 fps
2 tex env: 211 fps
3 tex env: 167 fps
4 tex env: 127 fps
5 tex env: 103 fps
6 tex env: 87 fps

The numbers were interesting but not quite conclusive enough, so a couple more with the Mesa multiarb demo (textures always enabled in-order in the demo):
multiarb 2 textures:
0 tex env: 230 fps
2 tex env: 227 fps
3 tex env: 210 fps
4 tex env: 191 fps
6 tex env: 162 fps

multiarb 3 textures:
0 tex env: 209 fps
2 tex env: 210 fps
3 tex env: 200 fps
4 tex env: 191 fps
6 tex env: 162 fps

multiarb 4 textures:
0 tex env: 191 fps
2 tex env: 191 fps
3 tex env: 191 fps
4 tex env: 187 fps
6 tex env: 162 fps

and finally some tests to see if it makes a difference what texture sampling stages (as opposed to the blending stages) are enabled, using a modified multiarb (with GL_REPLACE, and the same texture for the 1st and 4th texture, the "hacked" result means this used the 4th texture mapping unit, but the driver was hacked to use the 1st blending stage instead of the 4th).
multiarb (*) 1st tex:
1 tex env: 255 fps
4 tex env: 191 fps
normal: 257 fps

multiarb (*) 4th tex:
1 tex env: 255 fps
4 tex env: 191 fps
normal: 191 fps
hacked: 257 fps

So, conclusions: long shader programs indeed can have a (sometimes drastic - see the quake3 results) performance impact. However, it looks like you get about as many instructions basically for free as you use texturing units. Since with standard GL you have as many texture blending stages as you use texture mapping units, optimizing doesn't really seem to be worthwile (unless you actually don't need some texture lookups for the final result and could disable texture sampling for that unit). And, the results MAY be different for a r200 (instead of the rv250 I used), since it can sample 2 textures per clock as opposed to 1 (though if I'm not mistaken it is restricted to bilinear otherwise it needs 2 cycles whereas rv250 can do trilinear in one cycle), but has the same arithmetic throughput than the rv250, meaning it might not hide the arithmetic instructions so well. An optimization which would however have benefits would be to always use pix shader stages in-order - the time the chip needs to perform the calculations does not seem to depend on the number of stages enabled at all, but only the highest stage enabled (which is clearly not the case for texture sampling, there it doesn't seem to matter which units are used - again with r200 the results may be different, the 2 texturing units seem to be somewhat pair-wise arranged). I am not too sure though such code is common (I believe most apps usually use the texturing units in-order).

On a somewhat unrelated note, I was surprised to see a much larger performance difference in quake3 than multiarb (as multiarb does basically nothing but texturing, but quake3 also uses z-buffer etc.).

Roland


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
--
_______________________________________________
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Reply via email to