Ian Romanick wrote:
I'm actually wondering how ATI solved that problem in their driver,
 I couldn't see an easy way out to avoid the fallback - even using
 the 2 additional tex env stages or the second phase of the
fragment pipeline isn't going to fix the issue I think. Maybe
someone else has a good idea?


So, for any set of texture environments there is an ordering of operations and an assignment of registers that will work. Once upon
 a time I wrote a python script that implemented a simple algorithm
to do this.  I'll have to see if I can dig it up.

The algorithm works in two passes. The first pass identifies any texture stages and texture reads, if any, that do not contribute to the final result. I'm going to use the notation T# for a texture read and P for the previous result. If the texture environment is {
 {T0 + T1} {T3 - - T2} }, then T0, T1, and the result of adding them
don't contribute to the final result. You can omit those stages entirely and freely use those registers as temporaries.
Yes, that's what my code does too, it uses the regs which contain
unneeded textures as temporariers (it does not, however, eliminate the
texture lookups nor the env stages, I didn't want to mess with that
state for now but it could be done).

The second pass assigns registers. Each T# gets assigned the next R#, in order. If T0, T1, and T4 contribute to the final result, they get assigned R0, R1, and R2. Next, each P gets assigned an available register. A register is available if its either unassigned or its value will not be read again. At any point, there is *always* an available register. I think this is mathematically provable, but it's way beyond my patience to do so. :)
Ah you also reorder the texture assignments. This one I didn't look at,
looked like too much work (and it shouldn't make a difference).

Here are a couple examples. I have left out the operations for clarity. I'm also going to simplify a bit. I assume 3 textures, 3 registers, 3 stages, and 2 reads per stage.

Start:    {T0, T2}, {P , T0}, {T1, P } Pass 1:   {T0, T2}, {P , T0},
 {T1, P } Pass 2.1: {R0, R2}, {P , R0}, {R1, P } Pass 2.2: {R0, R2},
 {R2, R0}, {R1, R0}

Start: {T0, T2}, {T1, T0}, {T1, P } Pass 1: {T1, T0}, {T1, P } Pass 2.1: {R1, R0}, {R1, P } Pass 2.2: {R1, R0}, {R1, R0}

Working through this, I noticed something that I hadn't noticed before. This technique only works if each operation cannot access the entire register set. I first did it with 3 reads per stage, and I very quickly came up with some impossible examples. :) 3 reads w/6 registers will still work.
but you can have 6 reads per stage (thanks to different alpha/rgb
sources), e.g. the whole register set.
And, here's a counterprove to your theory, even assuming only 3 reads
per stage :-):
{T4, T5, P} {T2, T3, P} {T0, T1, P} {T4, T5, P} {T2, T3, P} {T0, T1, P}
How do you want to optimize that? In the first two stages you can't
assign any reg as all 6 texture sampling results are needed again (that
is, unless you analyze the whole "fragment program" and make a shorter
one, mathematically equivalent - but with all the different operations
possible plus scaling etc. this may not be possible).

The nice thing about this algorithm is that it not only works, but it
eliminates "dead code" and unused textures. I don't know about the former, but the later can certainly improve the performance of ill written code. In addition, this same algorithm could be used to optimize ATI_fragment_program code. It should also make it possible to implement NV_texture_env_combine4, which is used by a lot more programs than ATI_texture_env_combine3. In both these cases you need
 to expand the notation to have multiple P values.
I thought about those unused textures too, is it worth bothering and do
performance optimizations for crappy apps? Is such code even in
widespread use?

Other optimizations are possible, but I never explored them.  Most of
the ones that I could think of are probably unlikely in practice. Doing things like replacing {T1 + T2}, {P + P}, {P + T3} with {T1 + T2}*2, {P + T3}, or replacing {T1 * T2}, {P + T0} with {T1 * T2 + T0}
 are possible, but probably not worth the effort.
That gets close to the complexity of optimizing compilers, not my
strength :-). But you're probably right the env stages are likely executed faster than the texture lookups I suppose (though I have no idea how fast exactly they are executed, something like 1 clock per stage?). In contrast to optimize away unused textures though there should be more opportunity for such optimizations.

I think the right way to actually implement this in the driver is to convert texture env (be it ARB_texture_env_combine / ATI_texture_env_combine3 or NV_texture_env_combine4) into an ATI_fragment_program and optimize that. Doing it that way effectively kills two birds with one stone. We can get away with that here because the texture env will only ever require one pass. One nice thing about doing it that way is you can write an application that converts texture env scripts to ATI_fragment_programs. You can compare the direct implementation of
 the texture env with the generated ATI_fragment_program.  That
should be a *lot* easier to debug than doing it in the driver code!

It's also worth noting that a similar technique can be applied in the
i830 driver to implement ATI_texture_env_combine3. The i830 implements *most* of the required instructions. The unavailable instructions can be implemented by simpler operations (e.g., {T0*T1-T2} becomes {T0*T1} {P-T2}). Adding the optimization pass, especially if it *did* the optimizations that said were "probably not worth the effort", would reduce the chances of needing a fallback. An env like {T0*T1-T2} {P+T3} {P*C} {P+T0} would be optimized to {T0*T1+T3} {P-T2} {P*C+T0}.
Looks very nice, but quite complicated :-(.

If you don't think you want to tackle this now, I'll gather up my python script and all my notes on the subject and file an enhancement bug. That way none of the information will get lost / forgotten.
Yes, that would be nice.

Roland


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
--
_______________________________________________
Dri-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dri-devel

Reply via email to