On 27/08/17 20:24, Connor Abbott wrote: > Hi, > > On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <apinhe...@igalia.com > <mailto:apinhe...@igalia.com>> wrote: > > On 24/08/17 21:07, Connor Abbott wrote: > > > > Hi Alejandro, > > Hi Connor, > > > > > This seems really suspicious. If the live ranges are really > > independent, then the register allocator should be able to > assign the > > two virtual registers to the same physical register if it needs to. > > Yes, it is true, the register allocator should be able to assign two > virtual registers to the same physical register. But that is done > at the > end (or really near the end), so late for the problem this > optimization > is trying to fix. > > > Well, my understanding is that the problem is long compilation times > due to spilling and our not-so-great implementation of it. So no, > register allocation is not late for the problem. As both Curro and I > explained, the change by itself can only pessimise register > allocation, so if it helps then it must be due to a bug in the > register allocator or a problem in a subsequent pass that's getting > hidden by this one.
Ok. > > We are also reducing the amount of instructions used. > > > The comments in the source code say otherwise. Any instructions > eliminated were from spilling, which this pass only accidentally reduces. Yes, sorry, I explained myself poorly. The optimization itself doesn't remove any instructions. But using it reduces the final number of instructions, although as you say, they are likely due reducing the spilling. > > > > Probably not really clear on the commit message. When I say > "reduce the > pressure of the register allocator" I mean having a code that the > register allocator would be able to handle without using too much > time. > The problem this optimization tries to solve is that for some 16 > bit CTS > tests (some with matrices and geometry shaders), the amount of virtual > registers used and instructions was really big. For the record, > initially, some tests needed 24 min just to compile. Right now, thanks > to other optimizations, the slower test without this optimization > needs > 1min 30 seconds. Adding some hacky timestamps, the time used at > fs_visitor::allocate_registers (brw_fs.cpp:6096) is: > > * While trying to schedule using the three available pre mode > heuristics: 7 seconds > * Allocation with spilling: 63 seconds > * Final schedule using SCHEDULE_POST: 19 seconds > > With this optimization, the total time goes down to 14 seconds (10 > + 0 + > 3 on the previous bullet point list). > > One could argue that 1min 30 seconds is okish. But taking into account > that it goes down to 14 seconds, even with some caveats (see below), I > still think that it is worth to use the optimization. > > And a final comment. For that same test, this is the final stats > (using > INTEL_DEBUG): > > * With the optimization: SIMD8 shader: 4610 instructions. 0 loops. > 130320 cycles. 15:9 spills:fills. > * Without the optimization: SIMD8 shader: 12312 instructions. 0 > loops. > 174816 cycles. 751:1851 spills:fills. > > > So, the fact that it helps at all with SIMD8 shows that my theory is > wrong, but since your pass reduces spilling, it clearly must be > avoiding a bug somewhere else. You need to compare the IR for a shader > with the problem with and without this pass right before register > allocation. Maybe the sources and destinations of the conversion > instructions interfere without the change due to some other pass > that's increasing register pressure, in which case that's the problem, > but I doubt it. Ok, thanks for the hints. > (IIRC there's a debug option to print out the register pressure for > each instruction, which will be helpful here). I tried to use that option back when I was working on this patch, without too much luck. Will try again. > If that's not the case, you should check to see if the register > allocator thinks the sources and destinations of the conversion > instructions interfere, and if so figure out why - that will likely be > the real bug. Ok. Thanks Connor and Curro for the comments. I will work on a different solution. > > > > This change forces the two to be the same, which constrains the > > register allocator unecessarily and should make it worse, so I'm > > confused as to why this would help at all. > > I didn't check that issue specifically, but I recently found that this > optimization affects copy propagation/dead code eliminate. So > there are > still some room for improvement. But in any case, take into > account that > this custom optimization is only used if there is a 32 to 16 bit > conversion, so only affects shaders with this specific feature. > > > > > IIRC there were some issues where we unnecessarily made the sources > > and destination of an instruction interefere with each other, but if > > that's what's causing this, then we should fix that underlying > issue. > > > > (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the > > generator, in which case the second half of the source is read after > > the first half of the destination is written, and we falsely thought > > that the HW did that too, so we had some code to add a fake > > interference between them, but a while ago Curro moved the expansion > > to happen before register allocation. I don't have the code in front > > of me, but I think we still have this useless code lying around, > and I > > would guess this is the source of the problem.) > > Taking into account what I explained before, I don't think that the > problem is the interference or this code you mention (although perhaps > Im wrong). > > > hhhh > > Connor > > > > On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro" > <apinhe...@igalia.com <mailto:apinhe...@igalia.com> > > <mailto:apinhe...@igalia.com <mailto:apinhe...@igalia.com>>> wrote: > > > > When dealing with HF/U/UW, it is usual having a register with a > > F/D/UD, and then convert it to HF/U/UW, and not use again > the F/D/UD > > value. In those cases it would be possible to reuse the > register where > > the F value is initially stored instead of having two. Take > also into > > account that when operating with HF/U/UW, you would need to > use the > > full register (so stride 2). Packs/unpacks would be only > useful when > > loading/storing several HF/W/UW. > > > > Note that no instruction is removed. The main benefict is > reducing the > > amoung of registers used, so the pressure on the register > allocator is > > decreased with big shaders. > > > > Possibly this could be integrated into an existing > optimization, at it > > is even already done by the register allocator, but was far > easier to > > write and cleaner to read as a separate optimization. > > > > We found this issue when dealing with some Vulkan CTS tests that > > needed several minutes to compile. Most of the time was > spent on the > > register allocator. > > > > Right now the optimization only handles 32 to 16 bit > conversion. It > > could be possible to do the equivalent for 16 to 32 bit too, > but in > > practice, we didn't need it. > > --- > > src/intel/compiler/brw_fs.cpp | 77 > > +++++++++++++++++++++++++++++++++++++++++++ > > src/intel/compiler/brw_fs.h | 1 + > > 2 files changed, 78 insertions(+) > > > > diff --git a/src/intel/compiler/brw_fs.cpp > > b/src/intel/compiler/brw_fs.cpp > > index b6013a5ce85..1342150b44e 100644 > > --- a/src/intel/compiler/brw_fs.cpp > > +++ b/src/intel/compiler/brw_fs.cpp > > @@ -39,6 +39,7 @@ > > #include "compiler/glsl_types.h" > > #include "compiler/nir/nir_builder.h" > > #include "program/prog_parameter.h" > > +#include "brw_fs_live_variables.h" > > > > using namespace brw; > > > > @@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_rounding_modes() > > return progress; > > } > > > > +/** > > + * When dealing with HF/W/UW, it is usual having a register > with > > a F/D/UD, and > > + * then convert it to HF/W/UW, and not use again the F/D/UD > > value. In those > > + * cases it would be possible to reuse the register where the F > > value is > > + * initially stored instead of having two. Take also into > account > > that when > > + * operating with HF/W/UW, you would need to use the full > > register (so stride > > + * 2). Packs/unpacks would be only useful when loading/storing > > several > > + * HF/W/UWs. > > + * > > + * So something like this: > > + * mov(8) vgrf14<2>:HF, vgrf39:F > > + * > > + * Became: > > + * mov(8) vgrf39<2>:HF, vgrf39:F > > + * > > + * Note that no instruction is removed. The main benefict is > > reducing the > > + * amoung of registers used, so the pressure on the register > > allocator is > > + * decreased with big shaders. > > + */ > > +bool > > +fs_visitor::reuse_16bit_conversions_vgrf() > > +{ > > + bool progress = false; > > + int ip = -1; > > + > > + calculate_live_intervals(); > > + > > + foreach_block_and_inst_safe (block, fs_inst, inst, cfg) { > > + ip++; > > + > > + if (inst->dst.file != VGRF || inst->src[0].file != VGRF) > > + continue; > > + > > + if (inst->opcode != BRW_OPCODE_MOV) > > + continue; > > + > > + if (type_sz(inst->dst.type) != 2 || inst->dst.stride > != 2 || > > + type_sz(inst->src[0].type) != 4 || > inst->src[0].stride > > != 1) { > > + continue; > > + } > > + > > + int src_reg = inst->src[0].nr; > > + int src_offset = inst->src[0].offset; > > + unsigned src_var = > live_intervals->var_from_vgrf[src_reg]; > > + int src_end = live_intervals->end[src_var]; > > + int dst_reg = inst->dst.nr <http://dst.nr> > <http://dst.nr>; > > + > > + if (src_end > ip) > > + continue; > > + > > + foreach_block_and_inst(block, fs_inst, scan_inst, cfg) { > > + if (scan_inst->dst.file == VGRF && > > + scan_inst->dst.nr <http://dst.nr> > <http://dst.nr> == dst_reg) { > > + scan_inst->dst.nr <http://dst.nr> > <http://dst.nr> = src_reg; > > + scan_inst->dst.offset = src_offset; > > + progress = true; > > + } > > + > > + for (int i = 0; i < scan_inst->sources; i++) { > > + if (scan_inst->src[i].file == VGRF && > > + scan_inst->src[i].nr == dst_reg) { > > + scan_inst->src[i].nr = src_reg; > > + scan_inst->src[i].offset = src_offset; > > + progress = true; > > + } > > + } > > + } > > + } > > + > > + if (progress) > > + invalidate_live_intervals(); > > + > > + return progress; > > +} > > + > > > > static void > > clear_deps_for_inst_src(fs_inst *inst, bool *deps, int > first_grf, > > int grf_len) > > @@ -5829,6 +5905,7 @@ fs_visitor::optimize() > > > > OPT(opt_drop_redundant_mov_to_flags); > > OPT(remove_extra_rounding_modes); > > + OPT(reuse_16bit_conversions_vgrf); > > > > do { > > progress = false; > > diff --git a/src/intel/compiler/brw_fs.h > b/src/intel/compiler/brw_fs.h > > index b9476e69edb..2685f748b87 100644 > > --- a/src/intel/compiler/brw_fs.h > > +++ b/src/intel/compiler/brw_fs.h > > @@ -151,6 +151,7 @@ public: > > bool dead_code_eliminate(); > > bool remove_duplicate_mrf_writes(); > > bool remove_extra_rounding_modes(); > > + bool reuse_16bit_conversions_vgrf(); > > > > bool opt_sampler_eot(); > > bool virtual_grf_interferes(int a, int b); > > -- > > 2.11.0 > > > > _______________________________________________ > > mesa-dev mailing list > > mesa-dev@lists.freedesktop.org > <mailto:mesa-dev@lists.freedesktop.org> > <mailto:mesa-dev@lists.freedesktop.org > <mailto:mesa-dev@lists.freedesktop.org>> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev > <https://lists.freedesktop.org/mailman/listinfo/mesa-dev> > > <https://lists.freedesktop.org/mailman/listinfo/mesa-dev > <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>> > > > > > > > > > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev