On Mon, Feb 16, 2015 at 12:02 PM, Francisco Jerez <curroje...@riseup.net> wrote:
> Jason Ekstrand <ja...@jlekstrand.net> writes: > > > On Mon, Feb 16, 2015 at 10:40 AM, Francisco Jerez <curroje...@riseup.net > > > > wrote: > > > >> Jason Ekstrand <ja...@jlekstrand.net> writes: > >> > >> > On Feb 16, 2015 9:34 AM, "Francisco Jerez" <curroje...@riseup.net> > >> wrote: > >> >> > >> >> Jason Ekstrand <ja...@jlekstrand.net> writes: > >> >> > >> >> > On Feb 16, 2015 8:35 AM, "Francisco Jerez" <curroje...@riseup.net> > >> > wrote: > >> >> >> > >> >> >> The round-robin allocation strategy is expected to decrease the > >> amount > >> >> >> of false dependencies created by the register allocator and give > the > >> >> >> post-RA scheduling pass more freedom to move instructions > around. On > >> >> >> the other hand it has the disadvantage of increasing fragmentation > >> and > >> >> >> decreasing the number of equally-colored nearby nodes, what > increases > >> >> >> the likelihood of failure in presence of optimistically colorable > >> >> >> nodes. > >> >> >> > >> >> >> This patch disables the round-robin strategy for optimistically > >> >> >> colorable nodes. These typically arise in situations of high > >> register > >> >> >> pressure or for registers with large live intervals, in both cases > >> the > >> >> >> task of the instruction scheduler shouldn't be constrained > >> excessively > >> >> >> by the dense packing of those nodes, and a spill (or on Intel > >> hardware > >> >> >> a fall-back to SIMD8 mode) is invariably worse than a slightly > less > >> >> >> optimal scheduling. > >> >> > > >> >> > Actually, that's not true. Matt was doing some experiments > recently > >> > with a > >> >> > noise shader from synmark and the difference between our 2nd and > 3rd > >> > choice > >> >> > schedulers is huge. In that test he disabled the third choice > >> scheduler > >> >> > and the result was a shader that spilled 6 or 8 times but ran > >> something > >> >> > like 30% faster. We really need to do some more experimentation > with > >> >> > scheduling and figure out better heuristics than "SIMD16 is always > >> > faster" > >> >> > and "spilling is bad". > >> >> > > >> >> > >> >> Yes, I'm aware of rare corner cases like that where e.g. SIMD16 > leads to > >> >> higher cache thrashing than SIMD8 leading to decreased overall > >> >> performance, and a case where a shader SIMD16 *with* spills has > better > >> >> performance than the SIMD8 version of the same shader without spills. > >> >> > >> >> In any case it's not the register allocator's business to implement > such > >> >> heuristics, and that's not an argument against the register allocator > >> >> trying to make a more efficient use of the register file. > >> > > >> > The primary point I was trying to make is that scheduling *does* > matter. > >> > It matters a lot. In fact, Matt and i have talked about throwing away > >> the > >> > SIMD16 program if it ends up using the pessimal schedulong algorithm. > >> > Throwing scheduling to the wind just to gain a few SIMD16 programs is > >> > probably not a good trade-off. > >> > > >> In my experience the exact opposite observation has been far more > >> common. Running SIMD16 vs SIMD8 has a larger impact on performance than > >> the way you end up scheduling things post-regalloc. Actually even if > >> you end up causing some unmet instruction dependencies by the way > >> instructions are scheduled post-regalloc, the EU can context-switch to > >> service the next available thread almost for free when a thread stalls > >> on some dependency. Also the fact that you're doing SIMD16 itself makes > >> post-regalloc scheduling less important because it naturally has an > >> effect in hiding latency. > >> > >> My intuition is that the huge performance improvement Matt observed by > >> disabling the third scheduling heuristic is more likely to have been > >> caused by a decrease in the amount of cache thrashing caused by the fact > >> that he was running less channels concurrently rather than by the > >> scheduling heuristic itself. Matt, did you rule out that possibility? > >> > >> The other thing is this patch has an effect on the allocation strategy > >> for optimistically colorable nodes *only*. We're already heavily > >> constrained by register pressure when we get to that point, and assuming > >> allocation succeeds the post-regalloc scheduler is going to have little > >> room for maneuvering anyway. > >> > >> > It could be that this is an good idea, but it's going to take more > than > >> > hand-waved theories about register allocation one shader not spilling > to > >> > convince me. Do you actually know what it did to scheduling? It > >> wouldn't > >> > be hard to hack up the driver and shader-db to collect that > information. > >> > > >> 44 shaders going SIMD16 seems like a strong enough argument to me. > >> Could you be more precise about what additional information you want me > >> to collect? > >> > > > > How many shaders go from the first scheduling method to the second or to > > the third. In other words some sort of metric on which shaders are > > "helped" or "hurt" in their scheduling. > > OK, I hacked the driver to output the scheduling heuristic that had been > used when we allocated registers successfully for the program via > KHR_debug and then ran shader-db before and after applying this patch. > > Before this patch: > Heuristic SIMD8 SIMD16 > PRE 18924 18598 > PRE_NON_LIFO 72 38 > PRE_LIFO 8 3 > > After this patch: > Heuristic SIMD8 SIMD16 > PRE 18939 18643 > PRE_NON_LIFO 57 37 > PRE_LIFO 8 3 > > So it actually *decreases* the number of shaders falling back to the > latency-insensitive heuristics because the register allocator is more > likely to succeed with the PRE heuristic. > Yeah, I guess it only affects post-allocation scheduling and accidental dependencies (caused by two instructions happening to use the same register). Sorry for not thinking about that earlier. In that case, I have no problem with the patch but I can't say I've reviewed it either. I'll try and take a look at it tomorrow and do a real review. --Jason > > > > > >> > --Jason > >> > > >> >> >> Shader-db results on the i965 driver: > >> >> >> > >> >> >> total instructions in shared programs: 5488539 -> 5488489 (-0.00%) > >> >> >> instructions in affected programs: 1121 -> 1071 (-4.46%) > >> >> >> helped: 1 > >> >> >> HURT: 0 > >> >> >> GAINED: 49 > >> >> >> LOST: 5 > >> >> >> --- > >> >> >> src/util/register_allocate.c | 22 +++++++++++++++++++++- > >> >> >> 1 file changed, 21 insertions(+), 1 deletion(-) > >> >> >> > >> >> >> diff --git a/src/util/register_allocate.c > >> > b/src/util/register_allocate.c > >> >> >> index af7a20c..d63d8eb 100644 > >> >> >> --- a/src/util/register_allocate.c > >> >> >> +++ b/src/util/register_allocate.c > >> >> >> @@ -168,6 +168,12 @@ struct ra_graph { > >> >> >> > >> >> >> unsigned int *stack; > >> >> >> unsigned int stack_count; > >> >> >> + > >> >> >> + /** > >> >> >> + * Tracks the start of the set of optimistically-colored > >> registers > >> > in > >> >> > the > >> >> >> + * stack. > >> >> >> + */ > >> >> >> + unsigned int stack_optimistic_start; > >> >> >> }; > >> >> >> > >> >> >> /** > >> >> >> @@ -454,6 +460,7 @@ static void > >> >> >> ra_simplify(struct ra_graph *g) > >> >> >> { > >> >> >> bool progress = true; > >> >> >> + unsigned int stack_optimistic_start = ~0; > >> >> >> int i; > >> >> >> > >> >> >> while (progress) { > >> >> >> @@ -483,12 +490,16 @@ ra_simplify(struct ra_graph *g) > >> >> >> > >> >> >> if (!progress && best_optimistic_node != ~0U) { > >> >> >> decrement_q(g, best_optimistic_node); > >> >> >> + stack_optimistic_start = > >> >> >> + MIN2(stack_optimistic_start, g->stack_count); > >> >> >> g->stack[g->stack_count] = best_optimistic_node; > >> >> >> g->stack_count++; > >> >> >> g->nodes[best_optimistic_node].in_stack = true; > >> >> >> progress = true; > >> >> >> } > >> >> >> } > >> >> >> + > >> >> >> + g->stack_optimistic_start = stack_optimistic_start; > >> >> >> } > >> >> >> > >> >> >> /** > >> >> >> @@ -542,7 +553,16 @@ ra_select(struct ra_graph *g) > >> >> >> g->nodes[n].reg = r; > >> >> >> g->stack_count--; > >> >> >> > >> >> >> - if (g->regs->round_robin) > >> >> >> + /* Rotate the starting point except for optimistically > >> colorable > >> >> > nodes. > >> >> >> + * The likelihood that we will succeed at allocating > >> > optimistically > >> >> >> + * colorable nodes is highly dependent on the way that the > >> > previous > >> >> >> + * nodes popped off the stack are laid out. The > round-robin > >> >> > strategy > >> >> >> + * increases the fragmentation of the register file and > >> > decreases > >> >> > the > >> >> >> + * number of nearby nodes assigned to the same color, what > >> >> > increases the > >> >> >> + * likelihood of spilling with respect to the dense packing > >> >> > strategy. > >> >> >> + */ > >> >> >> + if (g->regs->round_robin && > >> >> >> + g->stack_count <= g->stack_optimistic_start) > >> >> >> start_search_reg = r + 1; > >> >> >> } > >> >> >> > >> >> >> -- > >> >> >> 2.1.3 > >> >> >> > >> >> >> _______________________________________________ > >> >> >> mesa-dev mailing list > >> >> >> mesa-dev@lists.freedesktop.org > >> >> >> http://lists.freedesktop.org/mailman/listinfo/mesa-dev > >> >
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev