Re: GL_ARB_texture_env_crossbar for r200

Roland Scheidegger Fri, 22 Jul 2005 04:39:42 -0700

Ian Romanick wrote:

I'm actually wondering how ATI solved that problem in their driver,
 I couldn't see an easy way out to avoid the fallback - even using
 the 2 additional tex env stages or the second phase of the
fragment pipeline isn't going to fix the issue I think. Maybe
someone else has a good idea?
So, for any set of texture environments there is an ordering ofoperations and an assignment of registers that will work. Once upon
 a time I wrote a python script that implemented a simple algorithm
to do this.  I'll have to see if I can dig it up.
The algorithm works in two passes. The first pass identifies anytexture stages and texture reads, if any, that do not contribute tothe final result. I'm going to use the notation T# for a textureread and P for the previous result. If the texture environment is {
 {T0 + T1} {T3 - - T2} }, then T0, T1, and the result of adding them
don't contribute to the final result. You can omit those stagesentirely and freely use those registers as temporaries.

Yes, that's what my code does too, it uses the regs which contain
unneeded textures as temporariers (it does not, however, eliminate the
texture lookups nor the env stages, I didn't want to mess with that
state for now but it could be done).

The second pass assigns registers. Each T# gets assigned the nextR#, in order. If T0, T1, and T4 contribute to the final result, theyget assigned R0, R1, and R2. Next, each P gets assigned anavailable register. A register is available if its either unassignedor its value will not be read again. At any point, there is*always* an available register. I think this is mathematicallyprovable, but it's way beyond my patience to do so. :)

Ah you also reorder the texture assignments. This one I didn't look at,
looked like too much work (and it shouldn't make a difference).

Here are a couple examples. I have left out the operations forclarity. I'm also going to simplify a bit. I assume 3 textures, 3registers, 3 stages, and 2 reads per stage.
Start:    {T0, T2}, {P , T0}, {T1, P } Pass 1:   {T0, T2}, {P , T0},
 {T1, P } Pass 2.1: {R0, R2}, {P , R0}, {R1, P } Pass 2.2: {R0, R2},
 {R2, R0}, {R1, R0}
Start: {T0, T2}, {T1, T0}, {T1, P } Pass 1: {T1, T0}, {T1, P }Pass 2.1: {R1, R0}, {R1, P } Pass 2.2: {R1, R0}, {R1, R0}
Working through this, I noticed something that I hadn't noticedbefore. This technique only works if each operation cannot access theentire register set. I first did it with 3 reads per stage, and Ivery quickly came up with some impossible examples. :) 3 reads w/6registers will still work.

but you can have 6 reads per stage (thanks to different alpha/rgb
sources), e.g. the whole register set.
And, here's a counterprove to your theory, even assuming only 3 reads
per stage :-):
{T4, T5, P} {T2, T3, P} {T0, T1, P} {T4, T5, P} {T2, T3, P} {T0, T1, P}
How do you want to optimize that? In the first two stages you can't
assign any reg as all 6 texture sampling results are needed again (that
is, unless you analyze the whole "fragment program" and make a shorter
one, mathematically equivalent - but with all the different operations
possible plus scaling etc. this may not be possible).

The nice thing about this algorithm is that it not only works, but it
eliminates "dead code" and unused textures. I don't know about theformer, but the later can certainly improve the performance of illwritten code. In addition, this same algorithm could be used tooptimize ATI_fragment_program code. It should also make it possibleto implement NV_texture_env_combine4, which is used by a lot moreprograms than ATI_texture_env_combine3. In both these cases you need
 to expand the notation to have multiple P values.

I thought about those unused textures too, is it worth bothering and do
performance optimizations for crappy apps? Is such code even in
widespread use?

Other optimizations are possible, but I never explored them.  Most of
the ones that I could think of are probably unlikely in practice.Doing things like replacing {T1 + T2}, {P + P}, {P + T3} with {T1 +T2}*2, {P + T3}, or replacing {T1 * T2}, {P + T0} with {T1 * T2 + T0}
 are possible, but probably not worth the effort.

That gets close to the complexity of optimizing compilers, not my

strength :-). But you're probably right the env stages are likelyexecuted faster than the texture lookups I suppose (though I have noidea how fast exactly they are executed, something like 1 clock perstage?). In contrast to optimize away unused textures though thereshould be more opportunity for such optimizations.

I think the right way to actually implement this in the driver is toconvert texture env (be it ARB_texture_env_combine /ATI_texture_env_combine3 or NV_texture_env_combine4) into anATI_fragment_program and optimize that. Doing it that wayeffectively kills two birds with one stone. We can get away withthat here because the texture env will only ever require one pass.One nice thing about doing it that way is you can write anapplication that converts texture env scripts toATI_fragment_programs. You can compare the direct implementation of
 the texture env with the generated ATI_fragment_program.  That
should be a *lot* easier to debug than doing it in the driver code!

It's also worth noting that a similar technique can be applied in the
i830 driver to implement ATI_texture_env_combine3. The i830implements *most* of the required instructions. The unavailableinstructions can be implemented by simpler operations (e.g.,{T0*T1-T2} becomes {T0*T1} {P-T2}). Adding the optimization pass,especially if it *did* the optimizations that said were "probably notworth the effort", would reduce the chances of needing a fallback.An env like {T0*T1-T2} {P+T3} {P*C} {P+T0} would be optimized to{T0*T1+T3} {P-T2} {P*C+T0}.

Looks very nice, but quite complicated :-(.

If you don't think you want to tackle this now, I'll gather up mypython script and all my notes on the subject and file an enhancementbug. That way none of the information will get lost / forgotten.

Yes, that would be nice.

Roland


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
--
_______________________________________________
Dri-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: GL_ARB_texture_env_crossbar for r200

Reply via email to