fs: Add reuse_16bit_conversions_register optimization

Alejandro Piñeiro Mon, 28 Aug 2017 03:25:29 -0700

On 27/08/17 20:24, Connor Abbott wrote:
> Hi,
>
> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <apinhe...@igalia.com
> <mailto:apinhe...@igalia.com>> wrote:
>
>     On 24/08/17 21:07, Connor Abbott wrote:
>     >
>     > Hi Alejandro,
>
>     Hi Connor,
>
>     >
>     > This seems really suspicious. If the live ranges are really
>     > independent, then the register allocator should be able to
>     assign the
>     > two virtual registers to the same physical register if it needs to.
>
>     Yes, it is true, the register allocator should be able to assign two
>     virtual registers to the same physical register. But that is done
>     at the
>     end (or really near the end), so late for the problem this
>     optimization
>     is trying to fix.
>
>
> Well, my understanding is that the problem is long compilation times
> due to spilling and our not-so-great implementation of it. So no,
> register allocation is not late for the problem. As both Curro and I
> explained, the change by itself can only pessimise register
> allocation, so if it helps then it must be due to a bug in the
> register allocator or a problem in a subsequent pass that's getting
> hidden by this one.


Ok.

>
>     We are also reducing the amount of instructions used.
>
>
> The comments in the source code say otherwise. Any instructions
> eliminated were from spilling, which this pass only accidentally reduces.

Yes, sorry, I explained myself poorly. The optimization itself doesn't
remove any instructions. But using it reduces the final number of
instructions, although as you say, they are likely due reducing the
spilling.

>
>
>
>     Probably not really clear on the commit message. When I say
>     "reduce the
>     pressure of the register allocator" I mean having a code that the
>     register allocator would be able to handle without using too much
>     time.
>     The problem this optimization tries to solve is that for some 16
>     bit CTS
>     tests (some with matrices and geometry shaders), the amount of virtual
>     registers used and instructions was really big. For the record,
>     initially, some tests needed 24 min just to compile. Right now, thanks
>     to other optimizations, the slower test without this optimization
>     needs
>     1min 30 seconds. Adding some hacky timestamps, the time used  at
>     fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>
>     * While trying to schedule using the three available pre mode
>     heuristics: 7 seconds
>     * Allocation with spilling: 63 seconds
>     * Final schedule using SCHEDULE_POST: 19 seconds
>
>     With this optimization, the total time goes down to 14 seconds (10
>     + 0 +
>     3 on the previous bullet point list).
>
>     One could argue that 1min 30 seconds is okish. But taking into account
>     that it goes down to 14 seconds, even with some caveats (see below), I
>     still think that it is worth to use the optimization.
>
>     And a final comment. For that same test, this is the final stats
>     (using
>     INTEL_DEBUG):
>
>      * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
>     130320 cycles. 15:9 spills:fills.
>      * Without the optimization: SIMD8 shader: 12312 instructions. 0
>     loops.
>     174816 cycles. 751:1851 spills:fills.
>
>
> So, the fact that it helps at all with SIMD8 shows that my theory is
> wrong, but since your pass reduces spilling, it clearly must be
> avoiding a bug somewhere else. You need to compare the IR for a shader
> with the problem with and without this pass right before register
> allocation. Maybe the sources and destinations of the conversion
> instructions interfere without the change due to some other pass
> that's increasing register pressure, in which case that's the problem,
> but I doubt it.

Ok, thanks for the hints.

> (IIRC there's a debug option to print out the register pressure for
> each instruction, which will be helpful here).

I tried to use that option back when I was working on this patch,
without too much luck. Will try again.

> If that's not the case, you should check to see if the register
> allocator thinks the sources and destinations of the conversion
> instructions interfere, and if so figure out why - that will likely be
> the real bug.

Ok.

Thanks Connor and Curro for the comments. I will work on a different
solution.

>
>
>     > This change forces the two to be the same, which constrains the
>     > register allocator unecessarily and should make it worse, so I'm
>     > confused as to why this would help at all.
>
>     I didn't check that issue specifically, but I recently found that this
>     optimization affects copy propagation/dead code eliminate. So
>     there are
>     still some room for improvement. But in any case, take into
>     account that
>     this custom optimization is only used if there is a 32 to 16 bit
>     conversion, so only affects shaders with this specific feature.
>
>     >
>     > IIRC there were some issues where we unnecessarily made the sources
>     > and destination of an instruction interefere with each other, but if
>     > that's what's causing this, then we should fix that underlying
>     issue.
>     >
>     > (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the
>     > generator, in which case the second half of the source is read after
>     > the first half of the destination is written, and we falsely thought
>     > that the HW did that too, so we had some code to add a fake
>     > interference between them, but a while ago Curro moved the expansion
>     > to happen before register allocation. I don't have the code in front
>     > of me, but I think we still have this useless code lying around,
>     and I
>     > would guess this is the source of the problem.)
>
>     Taking into account what I explained before, I don't think that the
>     problem is the interference or this code you mention (although perhaps
>     Im wrong).
>
>     > hhhh
>     > Connor
>     >
>     > On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro"
>     <apinhe...@igalia.com <mailto:apinhe...@igalia.com>
>     > <mailto:apinhe...@igalia.com <mailto:apinhe...@igalia.com>>> wrote:
>     >
>     >     When dealing with HF/U/UW, it is usual having a register with a
>     >     F/D/UD, and then convert it to HF/U/UW, and not use again
>     the F/D/UD
>     >     value. In those cases it would be possible to reuse the
>     register where
>     >     the F value is initially stored instead of having two. Take
>     also into
>     >     account that when operating with HF/U/UW, you would need to
>     use the
>     >     full register (so stride 2). Packs/unpacks would be only
>     useful when
>     >     loading/storing several HF/W/UW.
>     >
>     >     Note that no instruction is removed. The main benefict is
>     reducing the
>     >     amoung of registers used, so the pressure on the register
>     allocator is
>     >     decreased with big shaders.
>     >
>     >     Possibly this could be integrated into an existing
>     optimization, at it
>     >     is even already done by the register allocator, but was far
>     easier to
>     >     write and cleaner to read as a separate optimization.
>     >
>     >     We found this issue when dealing with some Vulkan CTS tests that
>     >     needed several minutes to compile. Most of the time was
>     spent on the
>     >     register allocator.
>     >
>     >     Right now the optimization only handles 32 to 16 bit
>     conversion. It
>     >     could be possible to do the equivalent for 16 to 32 bit too,
>     but in
>     >     practice, we didn't need it.
>     >     ---
>     >      src/intel/compiler/brw_fs.cpp | 77
>     >     +++++++++++++++++++++++++++++++++++++++++++
>     >      src/intel/compiler/brw_fs.h   |  1 +
>     >      2 files changed, 78 insertions(+)
>     >
>     >     diff --git a/src/intel/compiler/brw_fs.cpp
>     >     b/src/intel/compiler/brw_fs.cpp
>     >     index b6013a5ce85..1342150b44e 100644
>     >     --- a/src/intel/compiler/brw_fs.cpp
>     >     +++ b/src/intel/compiler/brw_fs.cpp
>     >     @@ -39,6 +39,7 @@
>     >      #include "compiler/glsl_types.h"
>     >      #include "compiler/nir/nir_builder.h"
>     >      #include "program/prog_parameter.h"
>     >     +#include "brw_fs_live_variables.h"
>     >
>     >      using namespace brw;
>     >
>     >     @@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_rounding_modes()
>     >         return progress;
>     >      }
>     >
>     >     +/**
>     >     + * When dealing with HF/W/UW, it is usual having a register
>     with
>     >     a F/D/UD, and
>     >     + * then convert it to HF/W/UW, and not use again the F/D/UD
>     >     value. In those
>     >     + * cases it would be possible to reuse the register where the F
>     >     value is
>     >     + * initially stored instead of having two. Take also into
>     account
>     >     that when
>     >     + * operating with HF/W/UW, you would need to use the full
>     >     register (so stride
>     >     + * 2). Packs/unpacks would be only useful when loading/storing
>     >     several
>     >     + * HF/W/UWs.
>     >     + *
>     >     + * So something like this:
>     >     + *  mov(8) vgrf14<2>:HF, vgrf39:F
>     >     + *
>     >     + * Became:
>     >     + *  mov(8) vgrf39<2>:HF, vgrf39:F
>     >     + *
>     >     + * Note that no instruction is removed. The main benefict is
>     >     reducing the
>     >     + * amoung of registers used, so the pressure on the register
>     >     allocator is
>     >     + * decreased with big shaders.
>     >     + */
>     >     +bool
>     >     +fs_visitor::reuse_16bit_conversions_vgrf()
>     >     +{
>     >     +   bool progress = false;
>     >     +   int ip = -1;
>     >     +
>     >     +   calculate_live_intervals();
>     >     +
>     >     +   foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
>     >     +      ip++;
>     >     +
>     >     +      if (inst->dst.file != VGRF || inst->src[0].file != VGRF)
>     >     +         continue;
>     >     +
>     >     +      if (inst->opcode != BRW_OPCODE_MOV)
>     >     +         continue;
>     >     +
>     >     +      if (type_sz(inst->dst.type) != 2 || inst->dst.stride
>     != 2 ||
>     >     +          type_sz(inst->src[0].type) != 4 ||
>     inst->src[0].stride
>     >     != 1) {
>     >     +         continue;
>     >     +      }
>     >     +
>     >     +      int src_reg = inst->src[0].nr;
>     >     +      int src_offset = inst->src[0].offset;
>     >     +      unsigned src_var =
>     live_intervals->var_from_vgrf[src_reg];
>     >     +      int src_end = live_intervals->end[src_var];
>     >     +      int dst_reg = inst->dst.nr <http://dst.nr>
>     <http://dst.nr>;
>     >     +
>     >     +      if (src_end > ip)
>     >     +         continue;
>     >     +
>     >     +      foreach_block_and_inst(block, fs_inst, scan_inst, cfg) {
>     >     +         if (scan_inst->dst.file == VGRF &&
>     >     +             scan_inst->dst.nr <http://dst.nr>
>     <http://dst.nr> == dst_reg) {
>     >     +            scan_inst->dst.nr <http://dst.nr>
>     <http://dst.nr> = src_reg;
>     >     +            scan_inst->dst.offset = src_offset;
>     >     +            progress = true;
>     >     +         }
>     >     +
>     >     +         for (int i = 0; i < scan_inst->sources; i++) {
>     >     +            if (scan_inst->src[i].file == VGRF &&
>     >     +                scan_inst->src[i].nr == dst_reg) {
>     >     +               scan_inst->src[i].nr = src_reg;
>     >     +               scan_inst->src[i].offset = src_offset;
>     >     +               progress = true;
>     >     +            }
>     >     +         }
>     >     +      }
>     >     +   }
>     >     +
>     >     +   if (progress)
>     >     +      invalidate_live_intervals();
>     >     +
>     >     +   return progress;
>     >     +}
>     >     +
>     >
>     >      static void
>     >      clear_deps_for_inst_src(fs_inst *inst, bool *deps, int
>     first_grf,
>     >     int grf_len)
>     >     @@ -5829,6 +5905,7 @@ fs_visitor::optimize()
>     >
>     >         OPT(opt_drop_redundant_mov_to_flags);
>     >         OPT(remove_extra_rounding_modes);
>     >     +   OPT(reuse_16bit_conversions_vgrf);
>     >
>     >         do {
>     >            progress = false;
>     >     diff --git a/src/intel/compiler/brw_fs.h
>     b/src/intel/compiler/brw_fs.h
>     >     index b9476e69edb..2685f748b87 100644
>     >     --- a/src/intel/compiler/brw_fs.h
>     >     +++ b/src/intel/compiler/brw_fs.h
>     >     @@ -151,6 +151,7 @@ public:
>     >         bool dead_code_eliminate();
>     >         bool remove_duplicate_mrf_writes();
>     >         bool remove_extra_rounding_modes();
>     >     +   bool reuse_16bit_conversions_vgrf();
>     >
>     >         bool opt_sampler_eot();
>     >         bool virtual_grf_interferes(int a, int b);
>     >     --
>     >     2.11.0
>     >
>     >     _______________________________________________
>     >     mesa-dev mailing list
>     >     mesa-dev@lists.freedesktop.org
>     <mailto:mesa-dev@lists.freedesktop.org>
>     <mailto:mesa-dev@lists.freedesktop.org
>     <mailto:mesa-dev@lists.freedesktop.org>>
>     >     https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>     <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>     >     <https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>     <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>>
>     >
>     >
>     >
>
>
>


_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

Reply via email to