On Mon, Nov 14, 2016 at 02:25:28PM +0000, Kyrill Tkachov wrote: > > On 11/11/16 15:31, Kyrill Tkachov wrote: > > > >On 11/11/16 10:17, Kyrill Tkachov wrote: > >> > >>On 10/11/16 23:39, Segher Boessenkool wrote: > >>>On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote: > >>>>On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov > >>>>>I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there > >>>>>were > >>>>>some interesting swings. > >>>>>458.sjeng +1.45% > >>>>>471.omnetpp +2.19% > >>>>>445.gobmk -2.01% > >>>>> > >>>>>On SPECFP: > >>>>>453.povray +7.00% > >>>> > >>>>Wow, this looks really good. Thank you for implementing this. If I > >>>>get some time I am going to try it out on other processors than A72 > >>>>but I doubt I have time any time soon. > >>>I'd love to hear what causes the slowdown for gobmk as well, btw. > >> > >>I haven't yet gotten a direct answer for that (through performance analysis > >>tools) but I have noticed that load/store pairs are not generated as > >>aggressively as I hoped. They are being merged by the sched fusion pass > >>and peepholes (which runs after this) but it still misses cases. I've > >>hacked the SWS hooks to generate pairs explicitly and that increases the > >>number of pairs and helps code size to boot. It complicates the logic of > >>the hooks a bit but not too much. > >> > >>I'll make those changes and re-benchmark, hopefully that > >>will help performance. > >> > > > >And here's a version that explicitly emits pairs. I've looked at assembly > >codegen on SPEC2006 and it generates quite a few more LDP/STP pairs than the > >original version. I kicked off benchmarks over the weekend to see the > >effect. Andrew, if you want to try it out (more benchmarking and testing > >always welcome) this is the one to try. > > > > And I discovered over the weekend that gamess and wrf have validation errors. > This version runs correctly. SPECINT results were fine though and there is > even a small overall gain due to sjeng and omnetpp. However, gobmk still has > the regression. I'll rerun SPECFP with this patch (it's really just a small > bugfix over the previous version) and get on with analysing gobmk.
I have some comments in line, most of which are about hardcoding the maximum register number, but at a high level this looks good to me. Thanks, James > 2016-11-11 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * config/aarch64/aarch64.h (machine_function): Add > reg_is_wrapped_separately field. > * config/aarch64/aarch64.c (emit_set_insn): Change return type to > rtx_insn *. > (aarch64_save_callee_saves): Don't save registers that are wrapped > separately. > (aarch64_restore_callee_saves): Don't restore registers that are > wrapped separately. > (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p, > aarch64_offset_7bit_signed_scaled_p): Move earlier in the file. > (aarch64_get_separate_components): New function. > (aarch64_get_next_set_bit): Likewise. > (aarch64_components_for_bb): Likewise. > (aarch64_disqualify_components): Likewise. > (aarch64_emit_prologue_components): Likewise. > (aarch64_emit_epilogue_components): Likewise. > (aarch64_set_handled_components): Likewise. > (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS, > TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB, > TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS, > TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS, > TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS, > TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define. > > commit 06ac3c30d8aa38781ee9019e60a5fcf727b85231 > Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com> > Date: Tue Oct 11 09:25:54 2016 +0100 > > [AArch64] Separate shrink wrapping hooks implementation > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index 325e725..2d33ef6 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -1138,7 +1138,7 @@ aarch64_is_extend_from_extract (machine_mode mode, rtx > mult_imm, > > /* Emit an insn that's a simple single-set. Both the operands must be > known to be valid. */ > -inline static rtx > +inline static rtx_insn * > emit_set_insn (rtx x, rtx y) > { > return emit_insn (gen_rtx_SET (x, y)); > @@ -3135,6 +3135,9 @@ aarch64_save_callee_saves (machine_mode mode, > HOST_WIDE_INT start_offset, > || regno == cfun->machine->frame.wb_candidate2)) > continue; > > + if (cfun->machine->reg_is_wrapped_separately[regno]) > + continue; > + > reg = gen_rtx_REG (mode, regno); > offset = start_offset + cfun->machine->frame.reg_offset[regno]; > mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx, > @@ -3143,6 +3146,7 @@ aarch64_save_callee_saves (machine_mode mode, > HOST_WIDE_INT start_offset, > regno2 = aarch64_next_callee_save (regno + 1, limit); > > if (regno2 <= limit > + && !cfun->machine->reg_is_wrapped_separately[regno2] > && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD) > == cfun->machine->frame.reg_offset[regno2])) > > @@ -3191,6 +3195,9 @@ aarch64_restore_callee_saves (machine_mode mode, > regno <= limit; > regno = aarch64_next_callee_save (regno + 1, limit)) > { > + if (cfun->machine->reg_is_wrapped_separately[regno]) > + continue; > + > rtx reg, mem; > > if (skip_wb > @@ -3205,6 +3212,7 @@ aarch64_restore_callee_saves (machine_mode mode, > regno2 = aarch64_next_callee_save (regno + 1, limit); > > if (regno2 <= limit > + && !cfun->machine->reg_is_wrapped_separately[regno2] > && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD) > == cfun->machine->frame.reg_offset[regno2])) > { > @@ -3224,6 +3232,272 @@ aarch64_restore_callee_saves (machine_mode mode, > } > } > > +static inline bool > +offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED, > + HOST_WIDE_INT offset) > +{ > + return offset >= -256 && offset < 256; A follow-up changing this to use IN_RANGE would seem correct. > +} > + > +static inline bool > +offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset) > +{ > + return (offset >= 0 > + && offset < 4096 * GET_MODE_SIZE (mode) > + && offset % GET_MODE_SIZE (mode) == 0); > +} > + > +bool > +aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset) > +{ > + return (offset >= -64 * GET_MODE_SIZE (mode) > + && offset < 64 * GET_MODE_SIZE (mode) > + && offset % GET_MODE_SIZE (mode) == 0); And certainly here, IN_RANGE would be neater. > +} > + > +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS. */ > + > +static sbitmap > +aarch64_get_separate_components (void) > +{ > + aarch64_layout_frame (); > + > + sbitmap components = sbitmap_alloc (V31_REGNUM + 1); > + bitmap_clear (components); > + > + /* The registers we need saved to the frame. */ > + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) > + if (aarch64_register_saved_on_entry (regno)) > + { > + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; > + if (!frame_pointer_needed) > + offset += cfun->machine->frame.frame_size > + - cfun->machine->frame.hard_fp_offset; > + /* Check that we can access the stack slot of the register with one > + direct load with no adjustments needed. */ > + if (offset_12bit_unsigned_scaled_p (DImode, offset)) > + bitmap_set_bit (components, regno); > + } > + > + /* Don't mess with the hard frame pointer. */ > + if (frame_pointer_needed) > + bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM); > + > + unsigned reg1 = cfun->machine->frame.wb_candidate1; > + unsigned reg2 = cfun->machine->frame.wb_candidate2; > + /* If aarch64_layout_frame has chosen registers to store/restore with > + writeback don't interfere with them to avoid having to output explicit > + stack adjustment instructions. */ > + if (reg2 != INVALID_REGNUM) > + bitmap_clear_bit (components, reg2); > + if (reg1 != INVALID_REGNUM) > + bitmap_clear_bit (components, reg1); > + > + bitmap_clear_bit (components, LR_REGNUM); > + bitmap_clear_bit (components, SP_REGNUM); > + > + return components; > +} > + > +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB. */ > + > +static sbitmap > +aarch64_components_for_bb (basic_block bb) > +{ > + bitmap in = DF_LIVE_IN (bb); > + bitmap gen = &DF_LIVE_BB_INFO (bb)->gen; > + bitmap kill = &DF_LIVE_BB_INFO (bb)->kill; > + > + sbitmap components = sbitmap_alloc (V31_REGNUM + 1); > + bitmap_clear (components); > + > + /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets. */ > + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) The use of R0_REGNUM and V31_REGNUM scare me a little bit, as we're hardcoding where the end of the register file is (does this, for example, fall apart with the SVE work that was recently posted). Something like a LAST_HARDREG_NUM might work? > + if ((!call_used_regs[regno]) > + && (bitmap_bit_p (in, regno) > + || bitmap_bit_p (gen, regno) > + || bitmap_bit_p (kill, regno))) > + bitmap_set_bit (components, regno); > + > + return components; > +} > + > +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS. > + Nothing to do for aarch64. */ > + > +static void > +aarch64_disqualify_components (sbitmap, edge, sbitmap, bool) > +{ > +} Is there no default "do nothing" hook for this? > + > +/* Return the next set bit in BMP from START onwards. Return the total > number > + of bits in BMP if no set bit is found at or after START. */ > + > +static unsigned int > +aarch64_get_next_set_bit (sbitmap bmp, unsigned int start) > +{ > + unsigned int nbits = SBITMAP_SIZE (bmp); > + if (start == nbits) > + return start; > + > + gcc_assert (start < nbits); > + for (unsigned int i = start; i < nbits; i++) > + if (bitmap_bit_p (bmp, i)) > + return i; > + > + return nbits; > +} > + > +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS. */ > + > +static void > +aarch64_emit_prologue_components (sbitmap components) > +{ > + rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed > + ? HARD_FRAME_POINTER_REGNUM > + : STACK_POINTER_REGNUM); > + > + unsigned total_bits = SBITMAP_SIZE (components); Would this be clearer called last_regno ? > + unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM); > + rtx_insn *insn = NULL; > + > + while (regno != total_bits) > + { > + machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode; Comment about why this can be DFmode rather than some 128-bit mode might be useful here. > + rtx reg = gen_rtx_REG (mode, regno); > + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; > + if (!frame_pointer_needed) > + offset += cfun->machine->frame.frame_size > + - cfun->machine->frame.hard_fp_offset; > + rtx addr = plus_constant (Pmode, ptr_reg, offset); > + rtx mem = gen_frame_mem (mode, addr); > + > + rtx set = gen_rtx_SET (mem, reg); > + unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1); > + /* No more registers to save after REGNO. > + Emit a single save and exit. */ > + if (regno2 == total_bits) > + { > + insn = emit_insn (set); > + RTX_FRAME_RELATED_P (insn) = 1; > + add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set)); > + break; > + } > + > + HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2]; > + /* The next register is not of the same class or its offset is not > + mergeable with the current one into a pair. */ > + if (!satisfies_constraint_Ump (mem) > + || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2) > + || (offset2 - cfun->machine->frame.reg_offset[regno]) > + != GET_MODE_SIZE (DImode)) > + { > + insn = emit_insn (set); > + RTX_FRAME_RELATED_P (insn) = 1; > + add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set)); > + > + regno = regno2; > + continue; > + } > + > + /* REGNO2 can be stored in a pair with REGNO. */ > + rtx reg2 = gen_rtx_REG (mode, regno2); > + if (!frame_pointer_needed) > + offset2 += cfun->machine->frame.frame_size > + - cfun->machine->frame.hard_fp_offset; > + rtx addr2 = plus_constant (Pmode, ptr_reg, offset2); > + rtx mem2 = gen_frame_mem (mode, addr2); > + rtx set2 = gen_rtx_SET (mem2, reg2); > + > + insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2)); > + RTX_FRAME_RELATED_P (insn) = 1; > + add_reg_note (insn, REG_CFA_OFFSET, set); > + add_reg_note (insn, REG_CFA_OFFSET, set2); > + > + regno = aarch64_get_next_set_bit (components, regno2 + 1); > + } > + > +} > + > +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS. */ > + > +static void > +aarch64_emit_epilogue_components (sbitmap components) Given the similarity of the logic, is there any way you can refactor this with the prologue code above? > +{ > + > + rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed > + ? HARD_FRAME_POINTER_REGNUM > + : STACK_POINTER_REGNUM); > + unsigned total_bits = SBITMAP_SIZE (components); > + unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM); > + rtx_insn *insn = NULL; > + > + while (regno != total_bits) > + { > + machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode; > + rtx reg = gen_rtx_REG (mode, regno); > + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; > + if (!frame_pointer_needed) > + offset += cfun->machine->frame.frame_size > + - cfun->machine->frame.hard_fp_offset; > + rtx addr = plus_constant (Pmode, ptr_reg, offset); > + rtx mem = gen_frame_mem (mode, addr); > + > + unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1); > + /* No more registers after REGNO to restore. > + Emit a single restore and exit. */ > + if (regno2 == total_bits) > + { > + insn = emit_move_insn (reg, mem); > + RTX_FRAME_RELATED_P (insn) = 1; > + add_reg_note (insn, REG_CFA_RESTORE, reg); > + break; > + } > + > + HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2]; > + /* The next register is not of the same class or its offset is not > + mergeable with the current one into a pair or the offset doesn't fit > + for a load pair. Emit a single restore and continue from REGNO2. */ > + if (!satisfies_constraint_Ump (mem) > + || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2) > + || (offset2 - cfun->machine->frame.reg_offset[regno]) > + != GET_MODE_SIZE (DImode)) > + { > + insn = emit_move_insn (reg, mem); > + RTX_FRAME_RELATED_P (insn) = 1; > + add_reg_note (insn, REG_CFA_RESTORE, reg); > + > + regno = regno2; > + continue; > + } > + > + /* REGNO2 can be loaded in a pair with REGNO. */ > + rtx reg2 = gen_rtx_REG (mode, regno2); > + if (!frame_pointer_needed) > + offset2 += cfun->machine->frame.frame_size > + - cfun->machine->frame.hard_fp_offset; > + rtx addr2 = plus_constant (Pmode, ptr_reg, offset2); > + rtx mem2 = gen_frame_mem (mode, addr2); > + > + insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2)); > + RTX_FRAME_RELATED_P (insn) = 1; > + add_reg_note (insn, REG_CFA_RESTORE, reg); > + add_reg_note (insn, REG_CFA_RESTORE, reg2); > + > + regno = aarch64_get_next_set_bit (components, regno2 + 1); > + } > +} > + > +/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS. */ > + > +static void > +aarch64_set_handled_components (sbitmap components) > +{ > + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) > + if (bitmap_bit_p (components, regno)) > + cfun->machine->reg_is_wrapped_separately[regno] = true; > +} > + > /* AArch64 stack frames generated by this compiler look like: > > +-------------------------------+ > @@ -3944,29 +4218,6 @@ aarch64_classify_index (struct aarch64_address_info > *info, rtx x, > return false; > } > > -bool > -aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset) > -{ > - return (offset >= -64 * GET_MODE_SIZE (mode) > - && offset < 64 * GET_MODE_SIZE (mode) > - && offset % GET_MODE_SIZE (mode) == 0); > -} > - > -static inline bool > -offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED, > - HOST_WIDE_INT offset) > -{ > - return offset >= -256 && offset < 256; > -} > - > -static inline bool > -offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset) > -{ > - return (offset >= 0 > - && offset < 4096 * GET_MODE_SIZE (mode) > - && offset % GET_MODE_SIZE (mode) == 0); > -} > - > /* Return true if MODE is one of the modes for which we > support LDP/STP operations. */ > > @@ -14452,6 +14703,30 @@ aarch64_optab_supported_p (int op, machine_mode > mode1, machine_mode, > #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \ > aarch64_first_cycle_multipass_dfa_lookahead_guard > > +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS > +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \ > + aarch64_get_separate_components > + > +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB > +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \ > + aarch64_components_for_bb > + > +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS > +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \ > + aarch64_disqualify_components > + > +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS > +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \ > + aarch64_emit_prologue_components > + > +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS > +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \ > + aarch64_emit_epilogue_components > + > +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS > +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \ > + aarch64_set_handled_components > + > #undef TARGET_TRAMPOLINE_INIT > #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init > > diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h > index 584ff5c..fb89e5a 100644 > --- a/gcc/config/aarch64/aarch64.h > +++ b/gcc/config/aarch64/aarch64.h > @@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame > typedef struct GTY (()) machine_function > { > struct aarch64_frame frame; > + /* One entry for each GPR and FP register. */ > + bool reg_is_wrapped_separately[V31_REGNUM + 1]; Another hardcoded use of V31_REGNUM. Thanks, James