Eero Tamminen <eero.t.tammi...@intel.com> writes: > Hi, > > I tested this on HSW GT2, BXT & SKL GT3e, and didn't see any significant > regressions this time. I'll try it also on a machine with smaller > variance than those (now that it became free), and send a note if that > does show something. >
You're not expected to be able to see any performance changes with this revision except in test-cases bound to the shader compilation performance (e.g. SynMark OglDrvShComp), and only if they do complex enough CFGs for the quadratic factor described below to be significant. > - Eero > > On 20.12.2017 21:27, Francisco Jerez wrote: >> Previously the dataflow propagation algorithm would calculate the ACP >> live-in and -out sets in a two-pass fixed-point algorithm. The first >> pass would update the live-out sets of all basic blocks of the program >> based on their live-in sets, while the second pass would update the >> live-in sets based on the live-out sets. This is incredibly >> inefficient in the typical case where the CFG of the program is >> approximately acyclic, because it can take up to 2*n passes for an ACP >> entry introduced at the top of the program to reach the bottom (where >> n is the number of basic blocks in the program), until which point the >> algorithm won't be able to reach a fixed point. >> >> The same effect can be achieved in a single pass by computing the >> live-in and -out sets in lock-step, because that makes sure that >> processing of any basic block will pick up the updated live-out sets >> of the lexically preceding blocks. This gives the dataflow >> propagation algorithm effectively O(n) run-time instead of O(n^2) in >> the acyclic case. >> >> The time spent in dataflow propagation is reduced by 30x in the >> GLES31.functional.ssbo.layout.random.all_shared_buffer.5 dEQP >> test-case on my CHV system (the improvement is likely to be of the >> same order of magnitude on other platforms). This more than reverses >> an apparent run-time regression in this test-case from my previous >> copy-propagation undefined-value handling patch, which was ultimately >> caused by the additional work introduced in that commit to account for >> undefined values being multiplied by a huge quadratic factor. >> >> According to Chad this test was failing on CHV due to a 30s time-out >> imposed by the Android CTS (this was the case regardless of my >> undefined-value handling patch, even though my patch substantially >> exacerbated the issue). On my CHV system this patch reduces the >> overall run-time of the test by approximately 12x, getting us to >> around 13s, well below the time-out. >> >> v2: Initialize live-out set to the universal set to avoid rather >> pessimistic dataflow estimation in shaders with cycles (Addresses >> performance regression reported by Eero in GpuTest Piano). >> Performance numbers given above still apply. No shader-db changes >> with respect to master. >> >> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104271 >> Reported-by: Chad Versace <chadvers...@chromium.org> >> --- >> src/intel/compiler/brw_fs_copy_propagation.cpp | 35 >> ++++++++------------------ >> 1 file changed, 11 insertions(+), 24 deletions(-) >> >> diff --git a/src/intel/compiler/brw_fs_copy_propagation.cpp >> b/src/intel/compiler/brw_fs_copy_propagation.cpp >> index af5635eacef..92cc0a8de58 100644 >> --- a/src/intel/compiler/brw_fs_copy_propagation.cpp >> +++ b/src/intel/compiler/brw_fs_copy_propagation.cpp >> @@ -186,8 +186,7 @@ fs_copy_prop_dataflow::setup_initial_values() >> >> /* Populate the initial values for the livein and liveout sets. For the >> * block at the start of the program, livein = 0 and liveout = copy. >> - * For the others, set liveout to 0 (the empty set) and livein to ~0 >> - * (the universal set). >> + * For the others, set liveout and livein to ~0 (the universal set). >> */ >> foreach_block (block, cfg) { >> if (block->parents.is_empty()) { >> @@ -197,7 +196,7 @@ fs_copy_prop_dataflow::setup_initial_values() >> } >> } else { >> for (int i = 0; i < bitset_words; i++) { >> - bd[block->num].liveout[i] = 0u; >> + bd[block->num].liveout[i] = ~0u; >> bd[block->num].livein[i] = ~0u; >> } >> } >> @@ -228,34 +227,17 @@ fs_copy_prop_dataflow::run() >> do { >> progress = false; >> >> - /* Update liveout for all blocks. */ >> foreach_block (block, cfg) { >> if (block->parents.is_empty()) >> continue; >> >> for (int i = 0; i < bitset_words; i++) { >> const BITSET_WORD old_liveout = bd[block->num].liveout[i]; >> - >> - bd[block->num].liveout[i] = >> - bd[block->num].copy[i] | (bd[block->num].livein[i] & >> - ~bd[block->num].kill[i]); >> - >> - if (old_liveout != bd[block->num].liveout[i]) >> - progress = true; >> - } >> - } >> - >> - /* Update livein for all blocks. If a copy is live out of all parent >> - * blocks, it's live coming in to this block. >> - */ >> - foreach_block (block, cfg) { >> - if (block->parents.is_empty()) >> - continue; >> - >> - for (int i = 0; i < bitset_words; i++) { >> - const BITSET_WORD old_livein = bd[block->num].livein[i]; >> BITSET_WORD livein_from_any_block = 0; >> >> + /* Update livein for this block. If a copy is live out of all >> + * parent blocks, it's live coming in to this block. >> + */ >> bd[block->num].livein[i] = ~0u; >> foreach_list_typed(bblock_link, parent_link, link, >> &block->parents) { >> bblock_t *parent = parent_link->block; >> @@ -278,7 +260,12 @@ fs_copy_prop_dataflow::run() >> */ >> bd[block->num].livein[i] &= livein_from_any_block; >> >> - if (old_livein != bd[block->num].livein[i]) >> + /* Update liveout for this block. */ >> + bd[block->num].liveout[i] = >> + bd[block->num].copy[i] | (bd[block->num].livein[i] & >> + ~bd[block->num].kill[i]); >> + >> + if (old_liveout != bd[block->num].liveout[i]) >> progress = true; >> } >> } >>
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev