On Sun, Nov 17, 2019 at 3:35 PM Richard Sandiford <richard.sandif...@arm.com> wrote: > > (It's 23:35 local time, so it's still just about stage 1. :-)) > > While working on SVE, I've noticed several cases in which we fail > to combine instructions because the combined form would need to be > placed earlier in the instruction stream than the last of the > instructions being combined. This includes one very important > case in the handling of the first fault register (FFR). > > Combine currently requires the combined instruction to live at the same > location as i3. I thought about trying to relax that restriction, but it > would be difficult to do with the current pass structure while keeping > everything linear-ish time. > > So this patch instead goes for an option that has been talked about > several times over the years: writing a new combine pass that just > does instruction combination, and not all the other optimisations > that have been bolted onto combine over time. E.g. it deliberately > doesn't do things like nonzero-bits tracking, since that really ought > to be a separate, more global, optimisation. > > This is still far from being a realistic replacement for the even > the combine parts of the current combine pass. E.g.: > > - it only handles combinations that can be built up from individual > two-instruction combinations. > > - it doesn't allow new hard register clobbers to be added. > > - it doesn't have the special treatment of CC operations. > > - etc. > > But we have to start somewhere. > > On a more positive note, the pass handles things that the current > combine pass doesn't: > > - the main motivating feature mentioned above: it works out where > the combined instruction could validly live and moves it there > if necessary. If there are a range of valid places, it tries > to pick the best one based on register pressure (although only > with a simple heuristic for now). > > - once it has combined two instructions, it can try combining the > result with both later and earlier code, i.e. it can combine > in both directions. > > - it tries using REG_EQUAL notes for the final instruction. > > - it can parallelise two independent instructions that both read from > the same register or both read from memory. > > This last feature is useful for generating more load-pair combinations > on AArch64. In some cases it can also produce more store-pair combinations, > but only for consecutive stores. However, since the pass currently does > this in a very greedy, peephole way, it only allows load/store-pair > combinations if the first memory access has a higher alignment than > the second, i.e. if we can be sure that the combined access is naturally > aligned. This should help it to make better decisions than the post-RA > peephole pass in some cases while not being too aggressive. > > The pass is supposed to be linear time without debug insns. > It only tries a constant number C of combinations per instruction > and its bookkeeping updates are constant-time. Once it has combined two > instructions, it'll try up to C combinations on the result, but this can > be counted against the instruction that was deleted by the combination > and so effectively just doubles the constant. (Note that C depends > on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.) > > Unfortunately, debug updates via propagate_for_debug are more expensive. > This could probably be fixed if the pass did more to track debug insns > itself, but using propagate_for_debug matches combine's behaviour. > > The patch adds two instances of the new pass: one before combine and > one after it. By default both are disabled, but this can be changed > using the new 3-bit run-combine param, where: > > - bit 0 selects the new pre-combine pass > - bit 1 selects the main combine pass > - bit 2 selects the new post-combine pass > > The idea is that run-combine=3 can be used to see which combinations > are missed by the new pass, while run-combine=6 (which I hope to be > the production setting for AArch64 at -O2+) just uses the new pass > to mop up cases that normal combine misses. Maybe in some distant > future, the pass will be good enough for run-combine=[14] to be a > realistic option. > > I ended up having to add yet another validate_simplify_* routine, > this time to do the equivalent of: > > newx = simplify_replace_rtx (*loc, old_rtx, new_rtx); > validate_change (insn, loc, newx, 1); > > but in a more memory-efficient way. validate_replace_rtx isn't suitable > because it deliberately only tries simplifications in limited cases: > > /* Do changes needed to keep rtx consistent. Don't do any other > simplifications, as it is not our job. */ > > And validate_simplify_insn isn't useful for this case because it works > on patterns that have already had changes made to them and expects > those patterns to be valid rtxes. simplify-replace operations instead > need to simplify as they go, when the original modes are still to hand. > > As far as compile-time goes, I tried compiling optabs.ii at -O2 > with an --enable-checking=release compiler: > > run-combine=2 (normal combine): 100.0% (baseline) > run-combine=4 (new pass only) 98.0% > run-combine=6 (both passes) 100.3% > > where the results are easily outside the noise. So the pass on > its own is quicker than combine, but that's not a fair comparison > when it doesn't do everything combine does. Running both passes > only has a slight overhead. > > To get a feel for the effect on multiple targets, I did my usual > bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg > and g++.dg, this time comparing run-combine=2 and run-combine=6 > using -O2 -ftree-vectorize: > > Target Tests Delta Best Worst Median > ====== ===== ===== ==== ===== ====== > aarch64-linux-gnu 3974 -39393 -2275 90 -2 > aarch64_be-linux-gnu 3389 -36683 -2275 165 -2 > alpha-linux-gnu 4154 -62860 -2132 335 -2 > amdgcn-amdhsa 4818 9079 -7987 51850 -2 > arc-elf 2868 -63710 -18998 286 -1 > arm-linux-gnueabi 4053 -80404 -10019 605 -2 > arm-linux-gnueabihf 4053 -80404 -10019 605 -2 > avr-elf 3620 38513 -2386 23364 2 > bfin-elf 2691 -32973 -1483 1127 -2 > bpf-elf 5581 -78105 -11064 113 -3 > c6x-elf 3915 -31710 -2441 1560 -2 > cr16-elf 6030 192102 -1757 60009 12 > cris-elf 2217 -30794 -1716 294 -2 > csky-elf 2003 -24989 -9999 1468 -2 > epiphany-elf 3345 -19416 -1803 4594 -2 > fr30-elf 3562 -15077 -1921 2334 -1 > frv-linux-gnu 2423 -16589 -1736 999 -1 > ft32-elf 2246 -46337 -15988 433 -2 > h8300-elf 2581 -33553 -1403 168 -2 > hppa64-hp-hpux11.23 3926 -120876 -50134 1056 -2 > i686-apple-darwin 3562 -46851 -1764 310 -2 > i686-pc-linux-gnu 2902 -3639 -4809 6848 -2 > ia64-linux-gnu 2900 -158870 -14006 428 -7 > iq2000-elf 2929 -54690 -2904 2576 -3 > lm32-elf 5265 162519 -1918 8004 5 > m32r-elf 1861 -25296 -2713 1004 -2 > m68k-linux-gnu 2520 -241573 -21879 200 -3 > mcore-elf 2378 -28532 -1810 1635 -2 > microblaze-elf 2782 -137363 -9516 1986 -2 > mipsel-linux-gnu 2443 -38422 -8331 458 -1 > mipsisa64-linux-gnu 2287 -60294 -12214 432 -2 > mmix 4910 -136549 -13616 599 -2 > mn10300-elf 2944 -29151 -2488 132 -1 > moxie-rtems 1935 -12364 -1002 125 -1 > msp430-elf 2379 -37007 -2163 176 -2 > nds32le-elf 2356 -27551 -2126 163 -1 > nios2-linux-gnu 1572 -44828 -23613 92 -2 > nvptx-none 1014 -17337 -1590 16 -3 > or1k-elf 2724 -92816 -14144 56 -3 > pdp11 1897 -27296 -1370 534 -2 > powerpc-ibm-aix7.0 2909 -58829 -10026 2001 -2 > powerpc64-linux-gnu 3685 -60551 -12158 2001 -1 > powerpc64le-linux-gnu 3501 -61846 -10024 765 -2 > pru-elf 1574 -29734 -19998 1718 -1 > riscv32-elf 2357 -22506 -10002 10175 -1 > riscv64-elf 3320 -56777 -10002 226 -2 > rl78-elf 2113 -232328 -18607 4065 -3 > rx-elf 2800 -38515 -896 491 -2 > s390-linux-gnu 3582 -75626 -12098 3999 -2 > s390x-linux-gnu 3761 -73473 -13748 3999 -2 > sh-linux-gnu 2350 -26401 -1003 522 -2 > sparc-linux-gnu 3279 -49518 -2175 2223 -2 > sparc64-linux-gnu 3849 -123084 -30200 2141 -2 > tilepro-linux-gnu 2737 -35562 -3458 2848 -2 > v850-elf 9002 -169126 -49996 76 -4 > vax-netbsdelf 3325 -57734 -10000 1989 -2 > visium-elf 1860 -17006 -1006 1066 -2 > x86_64-darwin 3278 -48933 -9999 1408 -2 > x86_64-linux-gnu 3008 -43887 -9999 3248 -2 > xstormy16-elf 2497 -26569 -2051 89 -2 > xtensa-elf 2161 -31231 -6910 138 -2 > > So running both passes does seem to have a significant benefit > on most targets, but there are some nasty-looking outliers. > The usual caveat applies: number of lines is a very poor measurement, > it's just to get a feel. > > Bootstrapped & regression-tested on aarch64-linux-gnu and > x86_64-linux-gnu with both run-combine=3 as the default (so that the new > pass runs first) and with run-combine=6 as the default (so that the new > pass runs second). There were no new execution failures. A couple of > guality.exp tests that already failed for most options started failing > for a couple more. Enabling the pass fixes the XFAILs in: > > gcc.target/aarch64/sve/acle/general/ptrue_pat_[234].c > > Inevitably there was some scan-assembler fallout for other tests. > E.g. in gcc.target/aarch64/vmov_n_1.c: > > #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory") > ... > INHIB_OPTIMIZATION; \ > (a) = TEST (test, data_len); \ > INHIB_OPTIMIZATION; \ > (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a)); \ > > is no longer effective for preventing move (a) from being merged > into (b), because the pass can merge at the point of (a). I think > this is a valid thing to do -- the asm semantics are still satisfied, > and asm volatile ("" : : : "memory") never acted as a register barrier. > But perhaps we should deal with this as a special case?
Not really. I think the testcase should be changed to: INHIB_OPT_VAR(a) instead. Where INHIB_OPT_VAR should be: #define INHIB_OPT_VAR(a) asm("":"+X"(a)); Since it is obviously not doing the correct testing in the first place. Even then, this testcase is huge and really should be broken up into different testcases. Thanks, Andrew > > Richard > > > 2019-11-17 Richard Sandiford <richard.sandif...@arm.com> > > gcc/ > * Makefile.in (OBJS): Add combine2.o > * params.opt (--param=run-combine): New option. > * doc/invoke.texi: Document it. > * tree-pass.h (make_pass_combine2_before): Declare. > (make_pass_combine2_after): Likewise. > * passes.def: Add them. > * timevar.def (TV_COMBINE2): New timevar. > * cfgrtl.h (update_cfg_for_uncondjump): Declare. > * combine.c (update_cfg_for_uncondjump): Move to... > * cfgrtl.c (update_cfg_for_uncondjump): ...here. > * simplify-rtx.c (simplify_truncation): Handle comparisons. > * recog.h (validate_simplify_replace_rtx): Declare. > * recog.c (validate_simplify_replace_rtx_1): New function. > (validate_simplify_replace_rtx_uses): Likewise. > (validate_simplify_replace_rtx): Likewise. > * combine2.c: New file. > > Index: gcc/Makefile.in > =================================================================== > --- gcc/Makefile.in 2019-11-14 14:34:27.599783740 +0000 > +++ gcc/Makefile.in 2019-11-17 23:15:31.188500613 +0000 > @@ -1261,6 +1261,7 @@ OBJS = \ > cgraphunit.o \ > cgraphclones.o \ > combine.o \ > + combine2.o \ > combine-stack-adj.o \ > compare-elim.o \ > context.o \ > Index: gcc/params.opt > =================================================================== > --- gcc/params.opt 2019-11-14 14:34:26.339792215 +0000 > +++ gcc/params.opt 2019-11-17 23:15:31.200500531 +0000 > @@ -768,6 +768,10 @@ Use internal function id in profile look > Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) > IntegerRange(2, 65536) Param > Maximum depth of a loop nest to fully value-number optimistically. > > +-param=run-combine= > +Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) > Param > +Choose which of the 3 available combine passes to run: bit 1 for the main > combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for > a later variant of the combine pass. > + > -param=sccvn-max-alias-queries-per-access= > Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) > Init(1000) Param > Maximum number of disambiguations to perform per memory access. > Index: gcc/doc/invoke.texi > =================================================================== > --- gcc/doc/invoke.texi 2019-11-16 10:43:45.597105823 +0000 > +++ gcc/doc/invoke.texi 2019-11-17 23:15:31.200500531 +0000 > @@ -11807,6 +11807,11 @@ in combiner for a pseudo register as las > @item max-combine-insns > The maximum number of instructions the RTL combiner tries to combine. > > +@item run-combine > +Choose which of the 3 available combine passes to run: bit 1 for the main > +combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 > +for a later variant of the combine pass. > + > @item integer-share-limit > Small integer constants can use a shared data structure, reducing the > compiler's memory usage and increasing its speed. This sets the maximum > Index: gcc/tree-pass.h > =================================================================== > --- gcc/tree-pass.h 2019-10-29 08:29:03.096444049 +0000 > +++ gcc/tree-pass.h 2019-11-17 23:15:31.204500501 +0000 > @@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i > extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt); > +extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt); > +extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt); > Index: gcc/passes.def > =================================================================== > --- gcc/passes.def 2019-10-29 08:29:03.224443133 +0000 > +++ gcc/passes.def 2019-11-17 23:15:31.200500531 +0000 > @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3. > NEXT_PASS (pass_inc_dec); > NEXT_PASS (pass_initialize_regs); > NEXT_PASS (pass_ud_rtl_dce); > + NEXT_PASS (pass_combine2_before); > NEXT_PASS (pass_combine); > + NEXT_PASS (pass_combine2_after); > NEXT_PASS (pass_if_after_combine); > NEXT_PASS (pass_jump_after_combine); > NEXT_PASS (pass_partition_blocks); > Index: gcc/timevar.def > =================================================================== > --- gcc/timevar.def 2019-10-11 15:43:53.403498517 +0100 > +++ gcc/timevar.def 2019-11-17 23:15:31.204500501 +0000 > @@ -251,6 +251,7 @@ DEFTIMEVAR (TV_AUTO_INC_DEC , " > DEFTIMEVAR (TV_CSE2 , "CSE 2") > DEFTIMEVAR (TV_BRANCH_PROB , "branch prediction") > DEFTIMEVAR (TV_COMBINE , "combiner") > +DEFTIMEVAR (TV_COMBINE2 , "second combiner") > DEFTIMEVAR (TV_IFCVT , "if-conversion") > DEFTIMEVAR (TV_MODE_SWITCH , "mode switching") > DEFTIMEVAR (TV_SMS , "sms modulo scheduling") > Index: gcc/cfgrtl.h > =================================================================== > --- gcc/cfgrtl.h 2019-03-08 18:15:39.320730391 +0000 > +++ gcc/cfgrtl.h 2019-11-17 23:15:31.192500584 +0000 > @@ -47,6 +47,7 @@ extern void fixup_partitions (void); > extern bool purge_dead_edges (basic_block); > extern bool purge_all_dead_edges (void); > extern bool fixup_abnormal_edges (void); > +extern void update_cfg_for_uncondjump (rtx_insn *); > extern rtx_insn *unlink_insn_chain (rtx_insn *, rtx_insn *); > extern void relink_block_chain (bool); > extern rtx_insn *duplicate_insn_chain (rtx_insn *, rtx_insn *); > Index: gcc/combine.c > =================================================================== > --- gcc/combine.c 2019-11-13 08:42:45.537368745 +0000 > +++ gcc/combine.c 2019-11-17 23:15:31.192500584 +0000 > @@ -2530,42 +2530,6 @@ reg_subword_p (rtx x, rtx reg) > && GET_MODE_CLASS (GET_MODE (x)) == MODE_INT; > } > > -/* Delete the unconditional jump INSN and adjust the CFG correspondingly. > - Note that the INSN should be deleted *after* removing dead edges, so > - that the kept edge is the fallthrough edge for a (set (pc) (pc)) > - but not for a (set (pc) (label_ref FOO)). */ > - > -static void > -update_cfg_for_uncondjump (rtx_insn *insn) > -{ > - basic_block bb = BLOCK_FOR_INSN (insn); > - gcc_assert (BB_END (bb) == insn); > - > - purge_dead_edges (bb); > - > - delete_insn (insn); > - if (EDGE_COUNT (bb->succs) == 1) > - { > - rtx_insn *insn; > - > - single_succ_edge (bb)->flags |= EDGE_FALLTHRU; > - > - /* Remove barriers from the footer if there are any. */ > - for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn)) > - if (BARRIER_P (insn)) > - { > - if (PREV_INSN (insn)) > - SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn); > - else > - BB_FOOTER (bb) = NEXT_INSN (insn); > - if (NEXT_INSN (insn)) > - SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn); > - } > - else if (LABEL_P (insn)) > - break; > - } > -} > - > /* Return whether PAT is a PARALLEL of exactly N register SETs followed > by an arbitrary number of CLOBBERs. */ > static bool > @@ -15096,7 +15060,10 @@ const pass_data pass_data_combine = > {} > > /* opt_pass methods: */ > - virtual bool gate (function *) { return (optimize > 0); } > + virtual bool gate (function *) > + { > + return optimize > 0 && (param_run_combine & 2) != 0; > + } > virtual unsigned int execute (function *) > { > return rest_of_handle_combine (); > Index: gcc/cfgrtl.c > =================================================================== > --- gcc/cfgrtl.c 2019-10-17 14:22:55.523309009 +0100 > +++ gcc/cfgrtl.c 2019-11-17 23:15:31.188500613 +0000 > @@ -3409,6 +3409,42 @@ fixup_abnormal_edges (void) > return inserted; > } > > +/* Delete the unconditional jump INSN and adjust the CFG correspondingly. > + Note that the INSN should be deleted *after* removing dead edges, so > + that the kept edge is the fallthrough edge for a (set (pc) (pc)) > + but not for a (set (pc) (label_ref FOO)). */ > + > +void > +update_cfg_for_uncondjump (rtx_insn *insn) > +{ > + basic_block bb = BLOCK_FOR_INSN (insn); > + gcc_assert (BB_END (bb) == insn); > + > + purge_dead_edges (bb); > + > + delete_insn (insn); > + if (EDGE_COUNT (bb->succs) == 1) > + { > + rtx_insn *insn; > + > + single_succ_edge (bb)->flags |= EDGE_FALLTHRU; > + > + /* Remove barriers from the footer if there are any. */ > + for (insn = BB_FOOTER (bb); insn; insn = NEXT_INSN (insn)) > + if (BARRIER_P (insn)) > + { > + if (PREV_INSN (insn)) > + SET_NEXT_INSN (PREV_INSN (insn)) = NEXT_INSN (insn); > + else > + BB_FOOTER (bb) = NEXT_INSN (insn); > + if (NEXT_INSN (insn)) > + SET_PREV_INSN (NEXT_INSN (insn)) = PREV_INSN (insn); > + } > + else if (LABEL_P (insn)) > + break; > + } > +} > + > /* Cut the insns from FIRST to LAST out of the insns stream. */ > > rtx_insn * > Index: gcc/simplify-rtx.c > =================================================================== > --- gcc/simplify-rtx.c 2019-11-16 15:33:36.642840131 +0000 > +++ gcc/simplify-rtx.c 2019-11-17 23:15:31.204500501 +0000 > @@ -851,6 +851,12 @@ simplify_truncation (machine_mode mode, > && trunc_int_for_mode (INTVAL (XEXP (op, 1)), mode) == -1) > return constm1_rtx; > > + /* (truncate:A (cmp X Y)) is (cmp:A X Y): we can compute the result > + in a narrower mode if useful. */ > + if (COMPARISON_P (op)) > + return simplify_gen_relational (GET_CODE (op), mode, VOIDmode, > + XEXP (op, 0), XEXP (op, 1)); > + > return NULL_RTX; > } > > Index: gcc/recog.h > =================================================================== > --- gcc/recog.h 2019-09-09 18:58:28.860430363 +0100 > +++ gcc/recog.h 2019-11-17 23:15:31.204500501 +0000 > @@ -111,6 +111,7 @@ extern int validate_replace_rtx_part_nos > extern void validate_replace_rtx_group (rtx, rtx, rtx_insn *); > extern void validate_replace_src_group (rtx, rtx, rtx_insn *); > extern bool validate_simplify_insn (rtx_insn *insn); > +extern bool validate_simplify_replace_rtx (rtx_insn *, rtx *, rtx, rtx); > extern int num_changes_pending (void); > extern int next_insn_tests_no_inequality (rtx_insn *); > extern bool reg_fits_class_p (const_rtx, reg_class_t, int, machine_mode); > Index: gcc/recog.c > =================================================================== > --- gcc/recog.c 2019-10-01 09:55:35.150088599 +0100 > +++ gcc/recog.c 2019-11-17 23:15:31.204500501 +0000 > @@ -922,6 +922,226 @@ validate_simplify_insn (rtx_insn *insn) > } > return ((num_changes_pending () > 0) && (apply_change_group () > 0)); > } > + > +/* A subroutine of validate_simplify_replace_rtx. Apply the replacement > + described by R to LOC. Return true on success; leave the caller > + to clean up on failure. */ > + > +static bool > +validate_simplify_replace_rtx_1 (validate_replace_src_data &r, rtx *loc) > +{ > + rtx x = *loc; > + enum rtx_code code = GET_CODE (x); > + machine_mode mode = GET_MODE (x); > + > + if (rtx_equal_p (x, r.from)) > + { > + validate_unshare_change (r.insn, loc, r.to, 1); > + return true; > + } > + > + /* Recursively apply the substitution and see if we can simplify > + the result. This specifically shouldn't use simplify_gen_*, > + since we want to avoid generating new expressions where possible. */ > + int old_num_changes = num_validated_changes (); > + rtx newx = NULL_RTX; > + bool recurse_p = false; > + switch (GET_RTX_CLASS (code)) > + { > + case RTX_UNARY: > + { > + machine_mode op0_mode = GET_MODE (XEXP (x, 0)); > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0))) > + return false; > + > + newx = simplify_unary_operation (code, mode, XEXP (x, 0), op0_mode); > + break; > + } > + > + case RTX_BIN_ARITH: > + case RTX_COMM_ARITH: > + { > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) > + return false; > + > + newx = simplify_binary_operation (code, mode, > + XEXP (x, 0), XEXP (x, 1)); > + break; > + } > + > + case RTX_COMPARE: > + case RTX_COMM_COMPARE: > + { > + machine_mode op_mode = (GET_MODE (XEXP (x, 0)) != VOIDmode > + ? GET_MODE (XEXP (x, 0)) > + : GET_MODE (XEXP (x, 1))); > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) > + return false; > + > + newx = simplify_relational_operation (code, mode, op_mode, > + XEXP (x, 0), XEXP (x, 1)); > + break; > + } > + > + case RTX_TERNARY: > + case RTX_BITFIELD_OPS: > + { > + machine_mode op0_mode = GET_MODE (XEXP (x, 0)); > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 2))) > + return false; > + > + newx = simplify_ternary_operation (code, mode, op0_mode, > + XEXP (x, 0), XEXP (x, 1), > + XEXP (x, 2)); > + break; > + } > + > + case RTX_EXTRA: > + if (code == SUBREG) > + { > + machine_mode inner_mode = GET_MODE (SUBREG_REG (x)); > + if (!validate_simplify_replace_rtx_1 (r, &SUBREG_REG (x))) > + return false; > + > + rtx inner = SUBREG_REG (x); > + newx = simplify_subreg (mode, inner, inner_mode, SUBREG_BYTE (x)); > + /* Reject the same cases that simplify_gen_subreg would. */ > + if (!newx > + && (GET_CODE (inner) == SUBREG > + || GET_CODE (inner) == CONCAT > + || GET_MODE (inner) == VOIDmode > + || !validate_subreg (mode, inner_mode, > + inner, SUBREG_BYTE (x)))) > + return false; > + break; > + } > + else > + recurse_p = true; > + break; > + > + case RTX_OBJ: > + if (code == LO_SUM) > + { > + if (!validate_simplify_replace_rtx_1 (r, &XEXP (x, 0)) > + || !validate_simplify_replace_rtx_1 (r, &XEXP (x, 1))) > + return false; > + > + /* (lo_sum (high x) y) -> y where x and y have the same base. */ > + rtx op0 = XEXP (x, 0); > + rtx op1 = XEXP (x, 1); > + if (GET_CODE (op0) == HIGH) > + { > + rtx base0, base1, offset0, offset1; > + split_const (XEXP (op0, 0), &base0, &offset0); > + split_const (op1, &base1, &offset1); > + if (rtx_equal_p (base0, base1)) > + newx = op1; > + } > + } > + else if (code == REG) > + { > + if (REG_P (r.from) && reg_overlap_mentioned_p (x, r.from)) > + return false; > + } > + else > + recurse_p = true; > + break; > + > + case RTX_CONST_OBJ: > + break; > + > + case RTX_AUTOINC: > + if (reg_overlap_mentioned_p (XEXP (x, 0), r.from)) > + return false; > + recurse_p = true; > + break; > + > + case RTX_MATCH: > + case RTX_INSN: > + gcc_unreachable (); > + } > + > + if (recurse_p) > + { > + const char *fmt = GET_RTX_FORMAT (code); > + for (int i = 0; fmt[i]; i++) > + switch (fmt[i]) > + { > + case 'E': > + for (int j = 0; j < XVECLEN (x, i); j++) > + if (!validate_simplify_replace_rtx_1 (r, &XVECEXP (x, i, j))) > + return false; > + break; > + > + case 'e': > + if (XEXP (x, i) > + && !validate_simplify_replace_rtx_1 (r, &XEXP (x, i))) > + return false; > + break; > + } > + } > + > + if (newx && !rtx_equal_p (x, newx)) > + { > + /* There's no longer any point unsharing the substitutions made > + for subexpressions, since we'll just copy this one instead. */ > + for (int i = old_num_changes; i < num_changes; ++i) > + changes[i].unshare = false; > + validate_unshare_change (r.insn, loc, newx, 1); > + } > + > + return true; > +} > + > +/* A note_uses callback for validate_simplify_replace_rtx. > + DATA points to a validate_replace_src_data object. */ > + > +static void > +validate_simplify_replace_rtx_uses (rtx *loc, void *data) > +{ > + validate_replace_src_data &r = *(validate_replace_src_data *) data; > + if (r.insn && !validate_simplify_replace_rtx_1 (r, loc)) > + r.insn = NULL; > +} > + > +/* Try to perform the equivalent of: > + > + newx = simplify_replace_rtx (*loc, OLD_RTX, NEW_RTX); > + validate_change (INSN, LOC, newx, 1); > + > + but without generating as much garbage rtl when the resulting > + pattern doesn't match. > + > + Return true if we were able to replace all uses of OLD_RTX in *LOC > + and if the result conforms to general rtx rules (e.g. for whether > + subregs are meaningful). > + > + When returning true, add all replacements to the current validation group, > + leaving the caller to test it in the normal way. Leave both *LOC and the > + validation group unchanged on failure. */ > + > +bool > +validate_simplify_replace_rtx (rtx_insn *insn, rtx *loc, > + rtx old_rtx, rtx new_rtx) > +{ > + validate_replace_src_data r; > + r.from = old_rtx; > + r.to = new_rtx; > + r.insn = insn; > + > + unsigned int num_changes = num_validated_changes (); > + note_uses (loc, validate_simplify_replace_rtx_uses, &r); > + if (!r.insn) > + { > + cancel_changes (num_changes); > + return false; > + } > + return true; > +} > > /* Return 1 if the insn using CC0 set by INSN does not contain > any ordered tests applied to the condition codes. > Index: gcc/combine2.c > =================================================================== > --- /dev/null 2019-09-17 11:41:18.176664108 +0100 > +++ gcc/combine2.c 2019-11-17 23:15:31.196500559 +0000 > @@ -0,0 +1,1576 @@ > +/* Combine instructions > + Copyright (C) 2019 Free Software Foundation, Inc. > + > +This file is part of GCC. > + > +GCC is free software; you can redistribute it and/or modify it under > +the terms of the GNU General Public License as published by the Free > +Software Foundation; either version 3, or (at your option) any later > +version. > + > +GCC is distributed in the hope that it will be useful, but WITHOUT ANY > +WARRANTY; without even the implied warranty of MERCHANTABILITY or > +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License > +for more details. > + > +You should have received a copy of the GNU General Public License > +along with GCC; see the file COPYING3. If not see > +<http://www.gnu.org/licenses/>. */ > + > +#include "config.h" > +#include "system.h" > +#include "coretypes.h" > +#include "backend.h" > +#include "rtl.h" > +#include "df.h" > +#include "tree-pass.h" > +#include "memmodel.h" > +#include "emit-rtl.h" > +#include "insn-config.h" > +#include "recog.h" > +#include "print-rtl.h" > +#include "rtl-iter.h" > +#include "predict.h" > +#include "cfgcleanup.h" > +#include "cfghooks.h" > +#include "cfgrtl.h" > +#include "alias.h" > +#include "valtrack.h" > + > +/* This pass tries to combine instructions in the following ways: > + > + (1) If we have two dependent instructions: > + > + I1: (set DEST1 SRC1) > + I2: (...DEST1...) > + > + and I2 is the only user of DEST1, the pass tries to combine them into: > + > + I2: (...SRC1...) > + > + (2) If we have two dependent instructions: > + > + I1: (set DEST1 SRC1) > + I2: (...DEST1...) > + > + the pass tries to combine them into: > + > + I2: (parallel [(set DEST1 SRC1) (...SRC1...)]) > + > + or: > + > + I2: (parallel [(...SRC1...) (set DEST1 SRC1)]) > + > + (3) If we have two independent instructions: > + > + I1: (set DEST1 SRC1) > + I2: (set DEST2 SRC2) > + > + that read from memory or from the same register, the pass tries to > + combine them into: > + > + I2: (parallel [(set DEST1 SRC1) (set DEST2 SRC2)]) > + > + or: > + > + I2: (parallel [(set DEST2 SRC2) (set DEST1 SRC1)]) > + > + If the combined form is a valid instruction, the pass tries to find a > + place between I1 and I2 inclusive for the new instruction. If there > + are multiple valid locations, it tries to pick the best one by taking > + the effect on register pressure into account. > + > + If a combination succeeds and produces a single set, the pass tries to > + combine the new form with earlier or later instructions. > + > + The pass currently optimizes each basic block separately. It walks > + the instructions in reverse order, building up live ranges for registers > + and memory. It then uses these live ranges to look for possible > + combination opportunities and to decide where the combined instructions > + could be placed. > + > + The pass represents positions in the block using point numbers, > + with higher numbers indicating earlier instructions. The numbering > + scheme is that: > + > + - the end of the current instruction sequence has an even base point B. > + > + - instructions initially have odd-numbered points B + 1, B + 3, etc. > + with B + 1 being the final instruction in the sequence. > + > + - even points after B represent gaps between instructions where combined > + instructions could be placed. > + > + Thus even points initially represent no instructions and odd points > + initially represent single instructions. However, when picking a > + place for a combined instruction, the pass may choose somewhere > + inbetween the original two instructions, so that over time a point > + may come to represent several instructions. When this happens, > + the pass maintains the invariant that all instructions with the same > + point number are independent of each other and thus can be treated as > + acting in parallel (or as acting in any arbitrary sequence). > + > + TODOs: > + > + - Handle 3-instruction combinations, and possibly more. > + > + - Handle existing clobbers more efficiently. At the moment we can't > + move an instruction that clobbers R across another instruction that > + clobbers R. > + > + - Allow hard register clobbers to be added, like combine does. > + > + - Perhaps work on EBBs, or SESE regions. */ > + > +namespace { > + > +/* The number of explicit uses to record in a live range. */ > +const unsigned int NUM_RANGE_USERS = 4; > + > +/* The maximum number of instructions that we can combine at once. */ > +const unsigned int MAX_COMBINE_INSNS = 2; > + > +/* A fake cost for instructions that we haven't costed yet. */ > +const unsigned int UNKNOWN_COST = ~0U; > + > +class combine2 > +{ > +public: > + combine2 (function *); > + ~combine2 (); > + > + void execute (); > + > +private: > + struct insn_info_rec; > + > + /* Describes the live range of a register or of memory. For simplicity, > + we treat memory as a single entity. > + > + If we had a fully-accurate live range, updating it to account for a > + moved instruction would be a linear-time operation. Doing this for > + each combination would then make the pass quadratic. We therefore > + just maintain a list of NUM_RANGE_USERS use insns and use simple, > + conservatively-correct behavior for the rest. */ > + struct live_range_rec > + { > + /* Which instruction provides the dominating definition, or null if > + we don't know yet. */ > + insn_info_rec *producer; > + > + /* A selection of instructions that use the resource, in program order. > */ > + insn_info_rec *users[NUM_RANGE_USERS]; > + > + /* An inclusive range of points that covers instructions not mentioned > + in USERS. Both values are zero if there are no such instructions. > + > + Once we've included a use U at point P in this range, we continue > + to assume that some kind of use exists at P whatever happens to U > + afterwards. */ > + unsigned int first_extra_use; > + unsigned int last_extra_use; > + > + /* The register number this range describes, or INVALID_REGNUM > + for memory. */ > + unsigned int regno; > + > + /* Forms a linked list of ranges for the same resource, in program > + order. */ > + live_range_rec *prev_range; > + live_range_rec *next_range; > + }; > + > + /* Pass-specific information about an instruction. */ > + struct insn_info_rec > + { > + /* The instruction itself. */ > + rtx_insn *insn; > + > + /* A null-terminated list of live ranges for the things that this > + instruction defines. */ > + live_range_rec **defs; > + > + /* A null-terminated list of live ranges for the things that this > + instruction uses. */ > + live_range_rec **uses; > + > + /* The point at which the instruction appears. */ > + unsigned int point; > + > + /* The cost of the instruction, or UNKNOWN_COST if we haven't > + measured it yet. */ > + unsigned int cost; > + }; > + > + /* Describes one attempt to combine instructions. */ > + struct combination_attempt_rec > + { > + /* The instruction that we're currently trying to optimize. > + If the combination succeeds, we'll use this insn_info_rec > + to describe the new instruction. */ > + insn_info_rec *new_home; > + > + /* The instructions we're combining, in program order. */ > + insn_info_rec *sequence[MAX_COMBINE_INSNS]; > + > + /* If we're substituting SEQUENCE[0] into SEQUENCE[1], this is the > + live range that describes the substituted register. */ > + live_range_rec *def_use_range; > + > + /* The earliest and latest points at which we could insert the > + combined instruction. */ > + unsigned int earliest_point; > + unsigned int latest_point; > + > + /* The cost of the new instruction, once we have a successful match. */ > + unsigned int new_cost; > + }; > + > + /* Pass-specific information about a register. */ > + struct reg_info_rec > + { > + /* The live range associated with the last reference to the register. */ > + live_range_rec *range; > + > + /* The point at which the last reference occurred. */ > + unsigned int next_ref; > + > + /* True if the register is currently live. We record this here rather > + than in a separate bitmap because (a) there's a natural hole for > + it on LP64 hosts and (b) we only refer to it when updating the > + other fields, and so recording it here should give better locality. > */ > + unsigned int live_p : 1; > + }; > + > + live_range_rec *new_live_range (unsigned int, live_range_rec *); > + live_range_rec *reg_live_range (unsigned int); > + live_range_rec *mem_live_range (); > + bool add_range_use (live_range_rec *, insn_info_rec *); > + void remove_range_use (live_range_rec *, insn_info_rec *); > + bool has_single_use_p (live_range_rec *); > + bool known_last_use_p (live_range_rec *, insn_info_rec *); > + unsigned int find_earliest_point (insn_info_rec *, insn_info_rec *); > + unsigned int find_latest_point (insn_info_rec *, insn_info_rec *); > + bool start_combination (combination_attempt_rec &, insn_info_rec *, > + insn_info_rec *, live_range_rec * = NULL); > + bool verify_combination (combination_attempt_rec &); > + int estimate_reg_pressure_delta (insn_info_rec *); > + void commit_combination (combination_attempt_rec &, bool); > + bool try_parallel_sets (combination_attempt_rec &, rtx, rtx); > + bool try_parallelize_insns (combination_attempt_rec &); > + bool try_combine_def_use_1 (combination_attempt_rec &, rtx, rtx, bool); > + bool try_combine_def_use (combination_attempt_rec &, rtx, rtx); > + bool try_combine_two_uses (combination_attempt_rec &); > + bool try_combine (insn_info_rec *, rtx, unsigned int); > + bool optimize_insn (insn_info_rec *); > + void record_defs (insn_info_rec *); > + void record_reg_use (insn_info_rec *, df_ref); > + void record_uses (insn_info_rec *); > + void process_insn (insn_info_rec *); > + void start_sequence (); > + > + /* The function we're optimizing. */ > + function *m_fn; > + > + /* The highest pseudo register number plus one. */ > + unsigned int m_num_regs; > + > + /* The current basic block. */ > + basic_block m_bb; > + > + /* True if we should optimize the current basic block for speed. */ > + bool m_optimize_for_speed_p; > + > + /* The point number to allocate to the next instruction we visit > + in the backward traversal. */ > + unsigned int m_point; > + > + /* The point number corresponding to the end of the current > + instruction sequence, i.e. the lowest point number about which > + we still have valid information. */ > + unsigned int m_end_of_sequence; > + > + /* The point number corresponding to the end of the current basic block. > + This is the same as M_END_OF_SEQUENCE when processing the last > + instruction sequence in a basic block. */ > + unsigned int m_end_of_bb; > + > + /* The memory live range, or null if we haven't yet found a memory > + reference in the current instruction sequence. */ > + live_range_rec *m_mem_range; > + > + /* Gives information about each register. We track both hard and > + pseudo registers. */ > + auto_vec<reg_info_rec> m_reg_info; > + > + /* A bitmap of registers whose entry in m_reg_info is valid. */ > + auto_sbitmap m_valid_regs; > + > + /* If nonnuull, an unused 2-element PARALLEL that we can use to test > + instruction combinations. */ > + rtx m_spare_parallel; > + > + /* A bitmap of instructions that we've already tried to combine with. */ > + auto_bitmap m_tried_insns; > + > + /* A temporary bitmap used to hold register numbers. */ > + auto_bitmap m_true_deps; > + > + /* An obstack used for allocating insn_info_recs and for building > + up their lists of definitions and uses. */ > + obstack m_insn_obstack; > + > + /* An obstack used for allocating live_range_recs. */ > + obstack m_range_obstack; > + > + /* Start-of-object pointers for the two obstacks. */ > + char *m_insn_obstack_start; > + char *m_range_obstack_start; > + > + /* A list of instructions that we've optimized and whose new forms > + change the cfg. */ > + auto_vec<rtx_insn *> m_cfg_altering_insns; > + > + /* The INSN_UIDs of all instructions in M_CFG_ALTERING_INSNS. */ > + auto_bitmap m_cfg_altering_insn_ids; > + > + /* We can insert new instructions at point P * 2 by inserting them > + after M_POINTS[P - M_END_OF_SEQUENCE / 2]. We can insert new > + instructions at point P * 2 + 1 by inserting them before > + M_POINTS[P - M_END_OF_SEQUENCE / 2]. */ > + auto_vec<rtx_insn *, 256> m_points; > +}; > + > +combine2::combine2 (function *fn) > + : m_fn (fn), > + m_num_regs (max_reg_num ()), > + m_bb (NULL), > + m_optimize_for_speed_p (false), > + m_point (2), > + m_end_of_sequence (m_point), > + m_end_of_bb (m_point), > + m_mem_range (NULL), > + m_reg_info (m_num_regs), > + m_valid_regs (m_num_regs), > + m_spare_parallel (NULL_RTX) > +{ > + gcc_obstack_init (&m_insn_obstack); > + gcc_obstack_init (&m_range_obstack); > + m_reg_info.quick_grow (m_num_regs); > + bitmap_clear (m_valid_regs); > + m_insn_obstack_start = XOBNEWVAR (&m_insn_obstack, char, 0); > + m_range_obstack_start = XOBNEWVAR (&m_range_obstack, char, 0); > +} > + > +combine2::~combine2 () > +{ > + obstack_free (&m_insn_obstack, NULL); > + obstack_free (&m_range_obstack, NULL); > +} > + > +/* Return true if it's possible in principle to combine INSN with > + other instructions. ALLOW_ASMS_P is true if the caller can cope > + with asm statements. */ > + > +static bool > +combinable_insn_p (rtx_insn *insn, bool allow_asms_p) > +{ > + rtx pattern = PATTERN (insn); > + > + if (GET_CODE (pattern) == USE || GET_CODE (pattern) == CLOBBER) > + return false; > + > + if (JUMP_P (insn) && find_reg_note (insn, REG_NON_LOCAL_GOTO, NULL_RTX)) > + return false; > + > + if (!allow_asms_p && asm_noperands (PATTERN (insn)) >= 0) > + return false; > + > + return true; > +} > + > +/* Return true if it's possible in principle to move INSN somewhere else, > + as long as all dependencies are satisfied. */ > + > +static bool > +movable_insn_p (rtx_insn *insn) > +{ > + if (JUMP_P (insn)) > + return false; > + > + if (volatile_refs_p (PATTERN (insn))) > + return false; > + > + return true; > +} > + > +/* Create and return a new live range for REGNO. NEXT is the next range > + in program order, or null if this is the first live range in the > + sequence. */ > + > +combine2::live_range_rec * > +combine2::new_live_range (unsigned int regno, live_range_rec *next) > +{ > + live_range_rec *range = XOBNEW (&m_range_obstack, live_range_rec); > + memset (range, 0, sizeof (*range)); > + > + range->regno = regno; > + range->next_range = next; > + if (next) > + next->prev_range = range; > + return range; > +} > + > +/* Return the current live range for register REGNO, creating a new > + one if necessary. */ > + > +combine2::live_range_rec * > +combine2::reg_live_range (unsigned int regno) > +{ > + /* Initialize the liveness flag, if it isn't already valid for this BB. */ > + bool first_ref_p = !bitmap_bit_p (m_valid_regs, regno); > + if (first_ref_p || m_reg_info[regno].next_ref < m_end_of_bb) > + m_reg_info[regno].live_p = bitmap_bit_p (df_get_live_out (m_bb), regno); > + > + /* See if we already have a live range associated with the current > + instruction sequence. */ > + live_range_rec *range = NULL; > + if (!first_ref_p && m_reg_info[regno].next_ref >= m_end_of_sequence) > + range = m_reg_info[regno].range; > + > + /* Create a new range if this is the first reference to REGNO in the > + current instruction sequence or if the current range has been closed > + off by a definition. */ > + if (!range || range->producer) > + { > + range = new_live_range (regno, range); > + > + /* If the register is live after the current sequence, treat that > + as a fake use at the end of the sequence. */ > + if (!range->next_range && m_reg_info[regno].live_p) > + range->first_extra_use = range->last_extra_use = m_end_of_sequence; > + > + /* Record that this is now the current range for REGNO. */ > + if (first_ref_p) > + bitmap_set_bit (m_valid_regs, regno); > + m_reg_info[regno].range = range; > + m_reg_info[regno].next_ref = m_point; > + } > + return range; > +} > + > +/* Return the current live range for memory, treating memory as a single > + entity. Create a new live range if necessary. */ > + > +combine2::live_range_rec * > +combine2::mem_live_range () > +{ > + if (!m_mem_range || m_mem_range->producer) > + m_mem_range = new_live_range (INVALID_REGNUM, m_mem_range); > + return m_mem_range; > +} > + > +/* Record that instruction USER uses the resource described by RANGE. > + Return true if this is new information. */ > + > +bool > +combine2::add_range_use (live_range_rec *range, insn_info_rec *user) > +{ > + /* See if we've already recorded the instruction, or if there's a > + spare use slot we can use. */ > + unsigned int i = 0; > + for (; i < NUM_RANGE_USERS && range->users[i]; ++i) > + if (range->users[i] == user) > + return false; > + > + if (i == NUM_RANGE_USERS) > + { > + /* Since we've processed USER recently, assume that it's more > + interesting to record explicitly than the last user in the > + current list. Evict that last user and describe it in the > + overflow "extra use" range instead. */ > + insn_info_rec *ousted_user = range->users[--i]; > + if (range->first_extra_use < ousted_user->point) > + range->first_extra_use = ousted_user->point; > + if (range->last_extra_use > ousted_user->point) > + range->last_extra_use = ousted_user->point; > + } > + > + /* Insert USER while keeping the list sorted. */ > + for (; i > 0 && range->users[i - 1]->point < user->point; --i) > + range->users[i] = range->users[i - 1]; > + range->users[i] = user; > + return true; > +} > + > +/* Remove USER from the uses recorded for RANGE, if we can. > + There's nothing we can do if USER was described in the > + overflow "extra use" range. */ > + > +void > +combine2::remove_range_use (live_range_rec *range, insn_info_rec *user) > +{ > + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) > + if (range->users[i] == user) > + { > + for (unsigned int j = i; j < NUM_RANGE_USERS - 1; ++j) > + range->users[j] = range->users[j + 1]; > + range->users[NUM_RANGE_USERS - 1] = NULL; > + break; > + } > +} > + > +/* Return true if RANGE has a single known user. */ > + > +bool > +combine2::has_single_use_p (live_range_rec *range) > +{ > + return range->users[0] && !range->users[1] && !range->first_extra_use; > +} > + > +/* Return true if we know that USER is the last user of RANGE. */ > + > +bool > +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user) > +{ > + if (range->last_extra_use <= user->point) > + return false; > + > + for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i) > + if (range->users[i] == user) > + return i == NUM_RANGE_USERS - 1 || !range->users[i + 1]; > + else if (range->users[i]->point == user->point) > + return false; > + > + gcc_unreachable (); > +} > + > +/* Find the earliest point that we could move I2 up in order to combine > + it with I1. Ignore any dependencies between I1 and I2; leave the > + caller to deal with those instead. */ > + > +unsigned int > +combine2::find_earliest_point (insn_info_rec *i2, insn_info_rec *i1) > +{ > + if (!movable_insn_p (i2->insn)) > + return i2->point; > + > + /* Start by optimistically assuming that we can move the instruction > + all the way up to I1. */ > + unsigned int point = i1->point; > + > + /* Make sure that the new position preserves all necessary true > dependencies > + on earlier instructions. */ > + for (live_range_rec **use = i2->uses; *use; ++use) > + { > + live_range_rec *range = *use; > + if (range->producer > + && range->producer != i1 > + && point >= range->producer->point) > + point = range->producer->point - 1; > + } > + > + /* Make sure that the new position preserves all necessary output and > + anti dependencies on earlier instructions. */ > + for (live_range_rec **def = i2->defs; *def; ++def) > + if (live_range_rec *range = (*def)->prev_range) > + { > + if (range->producer > + && range->producer != i1 > + && point >= range->producer->point) > + point = range->producer->point - 1; > + > + for (unsigned int i = NUM_RANGE_USERS - 1; i-- > 0;) > + if (range->users[i] && range->users[i] != i1) > + { > + if (point >= range->users[i]->point) > + point = range->users[i]->point - 1; > + break; > + } > + > + if (range->last_extra_use && point >= range->last_extra_use) > + point = range->last_extra_use - 1; > + } > + > + return point; > +} > + > +/* Find the latest point that we could move I1 down in order to combine > + it with I2. Ignore any dependencies between I1 and I2; leave the > + caller to deal with those instead. */ > + > +unsigned int > +combine2::find_latest_point (insn_info_rec *i1, insn_info_rec *i2) > +{ > + if (!movable_insn_p (i1->insn)) > + return i1->point; > + > + /* Start by optimistically assuming that we can move the instruction > + all the way down to I2. */ > + unsigned int point = i2->point; > + > + /* Make sure that the new position preserves all necessary anti > dependencies > + on later instructions. */ > + for (live_range_rec **use = i1->uses; *use; ++use) > + if (live_range_rec *range = (*use)->next_range) > + if (range->producer != i2 && point <= range->producer->point) > + point = range->producer->point + 1; > + > + /* Make sure that the new position preserves all necessary output and > + true dependencies on later instructions. */ > + for (live_range_rec **def = i1->defs; *def; ++def) > + { > + live_range_rec *range = *def; > + > + for (unsigned int i = 0; i < NUM_RANGE_USERS; ++i) > + if (range->users[i] != i2) > + { > + if (range->users[i] && point <= range->users[i]->point) > + point = range->users[i]->point + 1; > + break; > + } > + > + if (range->first_extra_use && point <= range->first_extra_use) > + point = range->first_extra_use + 1; > + > + live_range_rec *next_range = range->next_range; > + if (next_range > + && next_range->producer != i2 > + && point <= next_range->producer->point) > + point = next_range->producer->point + 1; > + } > + > + return point; > +} > + > +/* Initialize ATTEMPT for an attempt to combine instructions I1 and I2, > + where I1 is the instruction that we're currently trying to optimize. > + If DEF_USE_RANGE is nonnull, I1 defines the value described by > + DEF_USE_RANGE and I2 uses it. */ > + > +bool > +combine2::start_combination (combination_attempt_rec &attempt, > + insn_info_rec *i1, insn_info_rec *i2, > + live_range_rec *def_use_range) > +{ > + attempt.new_home = i1; > + attempt.sequence[0] = i1; > + attempt.sequence[1] = i2; > + if (attempt.sequence[0]->point < attempt.sequence[1]->point) > + std::swap (attempt.sequence[0], attempt.sequence[1]); > + attempt.def_use_range = def_use_range; > + > + /* Check that the instructions have no true dependencies other than > + DEF_USE_RANGE. */ > + bitmap_clear (m_true_deps); > + for (live_range_rec **def = attempt.sequence[0]->defs; *def; ++def) > + if (*def != def_use_range) > + bitmap_set_bit (m_true_deps, (*def)->regno); > + for (live_range_rec **use = attempt.sequence[1]->uses; *use; ++use) > + if (*use != def_use_range && bitmap_bit_p (m_true_deps, (*use)->regno)) > + return false; > + > + /* Calculate the range of points at which the combined instruction > + could live. */ > + attempt.earliest_point = find_earliest_point (attempt.sequence[1], > + attempt.sequence[0]); > + attempt.latest_point = find_latest_point (attempt.sequence[0], > + attempt.sequence[1]); > + if (attempt.earliest_point < attempt.latest_point) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "cannot combine %d and %d: no suitable" > + " location for combined insn\n", > + INSN_UID (attempt.sequence[0]->insn), > + INSN_UID (attempt.sequence[1]->insn)); > + return false; > + } > + > + /* Make sure we have valid costs for the original instructions before > + we start changing their patterns. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + if (attempt.sequence[i]->cost == UNKNOWN_COST) > + attempt.sequence[i]->cost = insn_cost (attempt.sequence[i]->insn, > + m_optimize_for_speed_p); > + return true; > +} > + > +/* Check whether the combination attempt described by ATTEMPT matches > + an .md instruction (or matches its constraints, in the case of an > + asm statement). If so, calculate the cost of the new instruction > + and check whether it's cheap enough. */ > + > +bool > +combine2::verify_combination (combination_attempt_rec &attempt) > +{ > + rtx_insn *insn = attempt.sequence[1]->insn; > + > + bool ok_p = verify_changes (0); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + if (!ok_p) > + fprintf (dump_file, "failed to match this instruction:\n"); > + else if (const char *name = get_insn_name (INSN_CODE (insn))) > + fprintf (dump_file, "successfully matched this instruction to %s:\n", > + name); > + else > + fprintf (dump_file, "successfully matched this instruction:\n"); > + print_rtl_single (dump_file, PATTERN (insn)); > + } > + if (!ok_p) > + return false; > + > + unsigned int cost1 = attempt.sequence[0]->cost; > + unsigned int cost2 = attempt.sequence[1]->cost; > + attempt.new_cost = insn_cost (insn, m_optimize_for_speed_p); > + ok_p = (attempt.new_cost <= cost1 + cost2); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "original cost = %d + %d, replacement cost = %d; > %s\n", > + cost1, cost2, attempt.new_cost, > + ok_p ? "keeping replacement" : "rejecting replacement"); > + if (!ok_p) > + return false; > + > + confirm_change_group (); > + return true; > +} > + > +/* Return true if we should consider register REGNO when calculating > + register pressure estimates. */ > + > +static bool > +count_reg_pressure_p (unsigned int regno) > +{ > + if (regno == INVALID_REGNUM) > + return false; > + > + /* Unallocatable registers aren't interesting. */ > + if (HARD_REGISTER_NUM_P (regno) && fixed_regs[regno]) > + return false; > + > + return true; > +} > + > +/* Try to estimate the effect that the original form of INSN_INFO > + had on register pressure, in the form "born - dying". */ > + > +int > +combine2::estimate_reg_pressure_delta (insn_info_rec *insn_info) > +{ > + int delta = 0; > + > + for (live_range_rec **def = insn_info->defs; *def; ++def) > + if (count_reg_pressure_p ((*def)->regno)) > + delta += 1; > + > + for (live_range_rec **use = insn_info->uses; *use; ++use) > + if (count_reg_pressure_p ((*use)->regno) > + && known_last_use_p (*use, insn_info)) > + delta -= 1; > + > + return delta; > +} > + > +/* We've moved FROM_INSN's pattern to TO_INSN and are about to delete > + FROM_INSN. Copy any useful information to TO_INSN before doing that. */ > + > +static void > +transfer_insn (rtx_insn *to_insn, rtx_insn *from_insn) > +{ > + INSN_LOCATION (to_insn) = INSN_LOCATION (from_insn); > + INSN_CODE (to_insn) = INSN_CODE (from_insn); > + REG_NOTES (to_insn) = REG_NOTES (from_insn); > +} > + > +/* The combination attempt in ATTEMPT has succeeded and is currently > + part of an open validate_change group. Commit to making the change > + and decide where the new instruction should go. > + > + KEPT_DEF_P is true if the new instruction continues to perform > + the definition described by ATTEMPT.def_use_range. */ > + > +void > +combine2::commit_combination (combination_attempt_rec &attempt, > + bool kept_def_p) > +{ > + insn_info_rec *new_home = attempt.new_home; > + rtx_insn *old_insn = attempt.sequence[0]->insn; > + rtx_insn *new_insn = attempt.sequence[1]->insn; > + > + /* Remove any notes that are no longer relevant. */ > + bool single_set_p = single_set (new_insn); > + for (rtx *note_ptr = ®_NOTES (new_insn); *note_ptr; ) > + { > + rtx note = *note_ptr; > + bool keep_p = true; > + switch (REG_NOTE_KIND (note)) > + { > + case REG_EQUAL: > + case REG_EQUIV: > + case REG_NOALIAS: > + keep_p = single_set_p; > + break; > + > + case REG_UNUSED: > + keep_p = false; > + break; > + > + default: > + break; > + } > + if (keep_p) > + note_ptr = &XEXP (*note_ptr, 1); > + else > + { > + *note_ptr = XEXP (*note_ptr, 1); > + free_EXPR_LIST_node (note); > + } > + } > + > + /* Complete the open validate_change group. */ > + confirm_change_group (); > + > + /* Decide where the new instruction should go. */ > + unsigned int new_point = attempt.latest_point; > + if (new_point != attempt.earliest_point > + && prev_real_insn (new_insn) != old_insn) > + { > + /* Prefer the earlier point if the combined instruction reduces > + register pressure and the latest point if it increases register > + pressure. > + > + The choice isn't obvious in the event of a tie, but picking > + the earliest point should reduce the number of times that > + we need to invalidate debug insns. */ > + int delta1 = estimate_reg_pressure_delta (attempt.sequence[0]); > + int delta2 = estimate_reg_pressure_delta (attempt.sequence[1]); > + bool move_up_p = (delta1 + delta2 <= 0); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + "register pressure delta = %d + %d; using %s position\n", > + delta1, delta2, move_up_p ? "earliest" : "latest"); > + if (move_up_p) > + new_point = attempt.earliest_point; > + } > + > + /* Translate inserting at NEW_POINT into inserting before or after > + a particular insn. */ > + rtx_insn *anchor = NULL; > + bool before_p = (new_point & 1); > + if (new_point != attempt.sequence[1]->point > + && new_point != attempt.sequence[0]->point) > + { > + anchor = m_points[(new_point - m_end_of_sequence) / 2]; > + rtx_insn *other_side = (before_p > + ? prev_real_insn (anchor) > + : next_real_insn (anchor)); > + /* Inserting next to an insn X and then deleting X is just a > + roundabout way of using X as the insertion point. */ > + if (anchor == new_insn || other_side == new_insn) > + new_point = attempt.sequence[1]->point; > + else if (anchor == old_insn || other_side == old_insn) > + new_point = attempt.sequence[0]->point; > + } > + > + /* Actually perform the move. */ > + if (new_point == attempt.sequence[1]->point) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "using insn %d to hold the combined pattern\n", > + INSN_UID (new_insn)); > + set_insn_deleted (old_insn); > + } > + else if (new_point == attempt.sequence[0]->point) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "using insn %d to hold the combined pattern\n", > + INSN_UID (old_insn)); > + PATTERN (old_insn) = PATTERN (new_insn); > + transfer_insn (old_insn, new_insn); > + std::swap (old_insn, new_insn); > + set_insn_deleted (old_insn); > + } > + else > + { > + /* We need to insert a new instruction. We can't simply move > + NEW_INSN because it acts as an insertion anchor in m_points. */ > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "inserting combined insn %s insn %d\n", > + before_p ? "before" : "after", INSN_UID (anchor)); > + > + rtx_insn *added_insn = (before_p > + ? emit_insn_before (PATTERN (new_insn), anchor) > + : emit_insn_after (PATTERN (new_insn), anchor)); > + transfer_insn (added_insn, new_insn); > + set_insn_deleted (old_insn); > + set_insn_deleted (new_insn); > + new_insn = added_insn; > + } > + df_insn_rescan (new_insn); > + > + /* Unlink the old uses. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use) > + remove_range_use (*use, attempt.sequence[i]); > + > + /* Work out which registers the new pattern uses. */ > + bitmap_clear (m_true_deps); > + df_ref use; > + FOR_EACH_INSN_USE (use, new_insn) > + { > + rtx reg = DF_REF_REAL_REG (use); > + bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg)); > + } > + FOR_EACH_INSN_EQ_USE (use, new_insn) > + { > + rtx reg = DF_REF_REAL_REG (use); > + bitmap_set_range (m_true_deps, REGNO (reg), REG_NREGS (reg)); > + } > + > + /* Describe the combined instruction in NEW_HOME. */ > + new_home->insn = new_insn; > + new_home->point = new_point; > + new_home->cost = attempt.new_cost; > + > + /* Build up a list of definitions for the combined instructions > + and update all the ranges accordingly. It shouldn't matter > + which order we do this in. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + for (live_range_rec **def = attempt.sequence[i]->defs; *def; ++def) > + if (kept_def_p || *def != attempt.def_use_range) > + { > + obstack_ptr_grow (&m_insn_obstack, *def); > + (*def)->producer = new_home; > + } > + obstack_ptr_grow (&m_insn_obstack, NULL); > + new_home->defs = (live_range_rec **) obstack_finish (&m_insn_obstack); > + > + /* Build up a list of uses for the combined instructions and update > + all the ranges accordingly. Again, it shouldn't matter which > + order we do this in. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + for (live_range_rec **use = attempt.sequence[i]->uses; *use; ++use) > + if (*use != attempt.def_use_range > + && add_range_use (*use, new_home)) > + obstack_ptr_grow (&m_insn_obstack, *use); > + obstack_ptr_grow (&m_insn_obstack, NULL); > + new_home->uses = (live_range_rec **) obstack_finish (&m_insn_obstack); > + > + /* There shouldn't be any remaining references to other instructions > + in the combination. Invalidate their contents to make lingering > + references a noisy failure. */ > + for (unsigned int i = 0; i < MAX_COMBINE_INSNS; ++i) > + if (attempt.sequence[i] != new_home) > + { > + attempt.sequence[i]->insn = NULL; > + attempt.sequence[i]->point = ~0U; > + } > + > + /* Unlink the def-use range. */ > + if (!kept_def_p && attempt.def_use_range) > + { > + live_range_rec *range = attempt.def_use_range; > + if (range->prev_range) > + range->prev_range->next_range = range->next_range; > + else > + m_reg_info[range->regno].range = range->next_range; > + if (range->next_range) > + range->next_range->prev_range = range->prev_range; > + } > + > + /* Record instructions whose new form alters the cfg. */ > + rtx pattern = PATTERN (new_insn); > + if ((returnjump_p (new_insn) > + || any_uncondjump_p (new_insn) > + || (GET_CODE (pattern) == TRAP_IF && XEXP (pattern, 0) == const1_rtx)) > + && bitmap_set_bit (m_cfg_altering_insn_ids, INSN_UID (new_insn))) > + m_cfg_altering_insns.safe_push (new_insn); > +} > + > +/* Return true if X1 and X2 are memories and if X1 does not have > + a higher alignment than X2. */ > + > +static bool > +dubious_mem_pair_p (rtx x1, rtx x2) > +{ > + return MEM_P (x1) && MEM_P (x2) && MEM_ALIGN (x1) <= MEM_ALIGN (x2); > +} > + > +/* Try implement ATTEMPT using (parallel [SET1 SET2]). */ > + > +bool > +combine2::try_parallel_sets (combination_attempt_rec &attempt, > + rtx set1, rtx set2) > +{ > + rtx_insn *insn = attempt.sequence[1]->insn; > + > + /* Combining two loads or two stores can be useful on targets that > + allow them to be treated as a single access. However, we use a > + very peephole approach to picking the pairs, so we need to be > + relatively confident that we're making a good choice. > + > + For now just aim for cases in which the memory references are > + consecutive and the first reference has a higher alignment. > + We can leave the target to test the consecutive part; whatever test > + we added here might be different from the target's, and in any case > + it's fine if the target accepts other well-aligned cases too. */ > + if (dubious_mem_pair_p (SET_DEST (set1), SET_DEST (set2)) > + || dubious_mem_pair_p (SET_SRC (set1), SET_SRC (set2))) > + return false; > + > + /* Cache the PARALLEL rtx between attempts so that we don't generate > + too much garbage rtl. */ > + if (!m_spare_parallel) > + { > + rtvec vec = gen_rtvec (2, set1, set2); > + m_spare_parallel = gen_rtx_PARALLEL (VOIDmode, vec); > + } > + else > + { > + XVECEXP (m_spare_parallel, 0, 0) = set1; > + XVECEXP (m_spare_parallel, 0, 1) = set2; > + } > + > + unsigned int num_changes = num_validated_changes (); > + validate_change (insn, &PATTERN (insn), m_spare_parallel, true); > + if (verify_combination (attempt)) > + { > + m_spare_parallel = NULL_RTX; > + return true; > + } > + cancel_changes (num_changes); > + return false; > +} > + > +/* Try to parallelize the two instructions in ATTEMPT. */ > + > +bool > +combine2::try_parallelize_insns (combination_attempt_rec &attempt) > +{ > + rtx_insn *i1_insn = attempt.sequence[0]->insn; > + rtx_insn *i2_insn = attempt.sequence[1]->insn; > + > + /* Can't parallelize asm statements. */ > + if (asm_noperands (PATTERN (i1_insn)) >= 0 > + || asm_noperands (PATTERN (i2_insn)) >= 0) > + return false; > + > + /* For now, just handle the case in which both instructions are > + single sets. We could handle more than 2 sets as well, but few > + targets support that anyway. */ > + rtx set1 = single_set (i1_insn); > + if (!set1) > + return false; > + rtx set2 = single_set (i2_insn); > + if (!set2) > + return false; > + > + /* Make sure that we have structural proof that the destinations > + are independent. Things like alias analysis rely on semantic > + information and assume no undefined behavior, which is rarely a > + good enough guarantee to allow a useful instruction combination. */ > + rtx dest1 = SET_DEST (set1); > + rtx dest2 = SET_DEST (set2); > + if (MEM_P (dest1) > + ? MEM_P (dest2) && nonoverlapping_memrefs_p (dest1, dest2, false) > + : !MEM_P (dest2) && reg_overlap_mentioned_p (dest1, dest2)) > + return false; > + > + /* Try the sets in both orders. */ > + if (try_parallel_sets (attempt, set1, set2) > + || try_parallel_sets (attempt, set2, set1)) > + { > + commit_combination (attempt, true); > + if (MAY_HAVE_DEBUG_BIND_INSNS > + && attempt.new_home->insn != i1_insn) > + propagate_for_debug (i1_insn, attempt.new_home->insn, > +