Re: Honnor ix86_accumulate_outgoing_args again

Jan Hubicka Sat, 19 Oct 2013 13:31:19 -0700

> Jan,
> 
> Does this seem reasonable to you?

Oops, sorry, I missed your email. (I was travelling and I am finishing a paper
now).
> 
> Thanks,
> Igor
> 
> > -----Original Message-----
> > From: Zamyatin, Igor
> > Sent: Tuesday, October 15, 2013 3:48 PM
> > To: Jan Hubicka
> > Subject: RE: Honnor ix86_accumulate_outgoing_args again
> > 
> > Jan,
> > 
> > Now we have following prologue in, say, phi0 routine in equake
> > 
> > 0x804aa90 1  push   %ebp
> > 0x804aa91 2  mov    %esp,%ebp
> > 0x804aa93 3  sub    $0x18,%esp
> > 0x804aa96 4  vmovsd 0x80ef7a8,%xmm0
> > 0x804aa9e 5  vmovsd 0x8(%ebp),%xmm1
> > 0x804aaa3 6  vcomisd %xmm1,%xmm0   <-- we see big stall somewhere here
> > or 1-2 instructions above
> > 
> > While earlier it was
> > 
> > 0x804abd0 1 sub    $0x2c,%esp
> > 0x804abd3 2 vmovsd 0x30(%esp),%xmm1
> > 0x804abd9 3 vmovsd 0x80efcc8,%xmm0
> > 0x804abe1 4 vcomisd %xmm1,%xmm0


Thanks for analysis! It is a different benchmark than for bulldozer, but
apparently same case.  Again we used to eliminate frame pointer here but IRS
now doesn't Do you see the same regression with -fno-omit-frame-pointer
-maccumulate-outgoing-args?

I suppose this is a conflict in between the push instruction hanled by stack
engine and initialization of EBP that isn't.  That would explain why bulldozer
don't seem to care about this particular benchmark (its stack engine seems to
have quite different design). 

This is a bit sad situation - accumulate-outgoing-args is expensive code size
wise and it seems we don't really need esp with -mno-accumulate-outgoing-args.
The non-accumulation code path was mistakely disabled for too long ;(

Vladimir, how much effort do you think it will be to fix the frame pointer
elimination here?
I can imagine it is a quite tricky case. If so I would suggest adding m_CORE_ALL
to X86_TUNE_ACCUMULATE_OUTGOING_ARGS with a comment explaining the problem and
mentioning the regression on equake on core and mgrid on Bulldizer and opening
an enhancement request for this...

I also wonder if direct ESP use and push/pop instructions are causing so
noticeable issues, I wonder if we can't "shrink wrap" this into red-zone in the
64bit compilation.  It seems that even with -maccumulate-outgoing-args pushing
the frame allocation as late as possible in the function would be a good idea
so it is not close to the push/pop/call/ret.

Honza

> > 
> > Note that for Atom and SLM no regression happens because they are already
> > in x86_accumulate_outgoing_args. As for other machines  - seems now
> > (after your change) they don't get that
> > MASK_ACCUMULATE_OUTGOING_ARGS and it leads to using ebp in the
> > prologue.
> > 
> > Thanks,
> > Igor
> > 
> > -----Original Message-----
> > From: Jan Hubicka [mailto:hubi...@ucw.cz]
> > Sent: Monday, October 14, 2013 8:44 PM
> > To: Zamyatin, Igor
> > Cc: Jan Hubicka
> > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > 
> > > Jan,
> > >
> > > I see that you haven't committed this change. Any particular reason for
> > this?
> > 
> > No, seems I forgot about it.
> > >
> > > BTW, after r203171 (http://gcc.gnu.org/ml/gcc-cvs/2013-
> > 10/msg00122.html) we see lot of performance degradation on spec2000 and
> > spec2006 tests on SandyBridge and IvyBridge in 32 bits mode. E.g.
> > 183.equake got ~-15%. Have you seen it?
> > 
> > I tested this only on Bulldozer chips where I saw large regression from 
> > mgrid,
> > but curiously not from equake.  I tracked it down to frame pointer being
> > forced by IRA that Vladimir promised to look at later.
> > 
> > I assumed that arg accumulation was tested before the flags was set, so I 
> > did
> > not benchmarked chips that previously dsabled it. Perhaps things changed
> > because IRA was merged in meantime while this feature was accidentally
> > disabled.
> > 
> > It would be great if you can figure out why equake regress - in FP
> > benchmarks we do not use push/pop so perhaps we just end up emitting
> > something really stupid or we manage to confuse stack engine...
> > 
> > Honza
> > 
> > >
> > > Thanks,
> > > Igor
> > >
> > > -----Original Message-----
> > > From: gcc-patches-ow...@gcc.gnu.org
> > > [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Jan Hubicka
> > > Sent: Thursday, October 10, 2013 10:40 PM
> > > To: Jan Hubicka
> > > Cc: Vladimir Makarov; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com;
> > > ubiz...@gmail.com; r...@redhat.com;
> > ganesh.gopalasubraman...@amd.com
> > > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > >
> > > Hi,
> > > this patch makes ACCUMULATE_OUTGOING_ARGS to disable itself when
> > function is cold.  I did some extra testing and to my amusement we now
> > seem to output more compact unwind info when
> > ACCUMULATE_OUTGOING_ARGS is disabled, so this seems quite consistent
> > code size win.
> > >
> > > We actually can do better and enable ACCUMULATE_OUTGOING_ARGS
> > only when function contains hot calls.  This should also avoid need for 
> > frame
> > allocation in prologue/epilogue on hot path then. I will look into this
> > incrementally..
> > >
> > > I also noticed that we still have some tuning flags in i386.c rather than 
> > > in
> > x86-tune.c so I moved them there.
> > >
> > > Testing x86_64-linux and will commit it once testing converge.
> > >
> > > Honza
> > >   * config/i386/i386.h (ACCUMULATE_OUTGOING_ARGS): Disable
> > accumulation
> > >   for cold functions.
> > >   * x86-tune.def (X86_TUNE_USE_LEAVE): Update comment.
> > >   (X86_TUNE_PUSH_MEMORY): Likewise.
> > >   (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > >   X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL): New.
> > >   (X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > X86_TUNE_ALWAYS_FANCY_MATH_387): New.
> > >   * i386.c (x86_accumulate_outgoing_args,
> > x86_arch_always_fancy_math_387,
> > >   x86_avx256_split_unaligned_load,
> > x86_avx256_split_unaligned_store):
> > >   Remove.
> > >   (ix86_option_override_internal): Update to use tune features
> > instead
> > >   of variables.
> > > Index: config/i386/i386.h
> > >
> > ==========================================================
> > =========
> > > --- config/i386/i386.h    (revision 203380)
> > > +++ config/i386/i386.h    (working copy)
> > > @@ -1492,13 +1492,26 @@ enum reg_class
> > >     will be computed and placed into the variable 
> > > `crtl->outgoing_args_size'.
> > >     No space will be pushed onto the stack for each call; instead, the
> > >     function prologue should increase the stack frame size by this amount.
> > > +
> > > +   In 32bit mode enabling argument accumulation results in about 5% code
> > size
> > > +   growth becuase move instructions are less compact than push.  In 64bit
> > > +   mode the difference is less drastic but visible.
> > > +
> > > +   FIXME: Unlike earlier implementations, the size of unwind info seems 
> > > to
> > > +   actually grouw with accumulation.  Is that because accumulated args
> > > +   unwind info became unnecesarily bloated?
> > >
> > >     64-bit MS ABI seem to require 16 byte alignment everywhere except for
> > > -   function prologue and apilogue.  This is not possible without
> > > -   ACCUMULATE_OUTGOING_ARGS.  */
> > > +   function prologue and epilogue.  This is not possible without
> > > +   ACCUMULATE_OUTGOING_ARGS.
> > > +
> > > +   If stack probes are required, the space used for large function
> > > +   arguments on the stack must also be probed, so enable
> > > +   -maccumulate-outgoing-args so this happens in the prologue.  */
> > >
> > >  #define ACCUMULATE_OUTGOING_ARGS \
> > > -  (TARGET_ACCUMULATE_OUTGOING_ARGS || TARGET_64BIT_MS_ABI)
> > > +  ((TARGET_ACCUMULATE_OUTGOING_ARGS &&
> > optimize_function_for_speed_p (cfun)) \
> > > +   || TARGET_STACK_PROBE || TARGET_64BIT_MS_ABI)
> > >
> > >  /* If defined, a C expression whose value is nonzero when we want to use
> > PUSH
> > >     instructions to pass outgoing arguments.  */
> > > Index: config/i386/x86-tune.def
> > >
> > ==========================================================
> > =========
> > > --- config/i386/x86-tune.def      (revision 203387)
> > > +++ config/i386/x86-tune.def      (working copy)
> > > @@ -18,15 +18,13 @@ a copy of the GCC Runtime Library Except  see the
> > > files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> > > <http://www.gnu.org/licenses/>.  */
> > >
> > > -/* X86_TUNE_USE_LEAVE: Leave does not affect Nocona SPEC2000 results
> > > -   negatively, so enabling for Generic64 seems like good code size
> > > -   tradeoff.  We can't enable it for 32bit generic because it does not
> > > -   work well with PPro base chips.  */
> > > +/* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where it
> > > +fits.  */
> > >  DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
> > >     m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE |
> > m_GENERIC)
> > >
> > >  /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem"
> > instructions.
> > > -   Some chips, like 486 and Pentium have problems with these sequences..
> > */
> > > +   Some chips, like 486 and Pentium works faster with separate load
> > > +   and push instructions.  */
> > >  DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory",
> > >            m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE |
> > m_AMD_MULTIPLE
> > >            | m_GENERIC)
> > > @@ -210,6 +208,16 @@ DEF_TUNE
> > (X86_TUNE_SSE_UNALIGNED_LOAD_OP  DEF_TUNE
> > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL,
> > "sse_unaligned_store_optimal",
> > >            m_COREI7 | m_BDVER | m_SLM | m_GENERIC)
> > >
> > > +/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true, unaligned
> > loads are
> > > +   split.  */
> > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > "256_unaligned_load_optimal",
> > > +          ~(m_COREI7 | m_GENERIC))
> > > +
> > > +/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true, unaligned
> > loads are
> > > +   split.  */
> > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL,
> > "256_unaligned_load_optimal",
> > > +          ~(m_COREI7 | m_BDVER | m_GENERIC))
> > > +
> > >  /* Use packed single precision instructions where posisble.  I.e. movups
> > instead
> > >     of movupd.  */
> > >  DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
> > "sse_packed_single_insn_optimal", @@ -398,3 +406,24 @@ DEF_TUNE
> > (X86_TUNE_AVOID_MEM_OPND_FOR_CM
> > >     fp converts to destination register.  */  DEF_TUNE
> > (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS,
> > "split_mem_opnd_for_fp_converts",
> > >            m_SLM)
> > > +
> > > +/* X86_TUNE_ACCUMULATE_OUTGOING_ARGS: Allocate stack space for
> > outgoing
> > > +   arguments in prologue/epilogue instead of separately for each call
> > > +   by push/pop instructions.
> > > +   This increase code size by about 5% in 32bit mode, less so in 64bit 
> > > mode
> > > +   because parameters are passed in registers.  It is considerable
> > > +   win for targets without stack engine that prevents multple push
> > operations
> > > +   to happen in parallel.
> > > +
> > > +   FIXME: the flags is incorrectly enabled for amdfam10, Bulldozer,
> > > +   Bobcat and Generic.  This is because disabling it causes large
> > > +   regression on mgrid due to IRA limitation leading to unecessary
> > > +   use of the frame pointer in 32bit mode.  */ DEF_TUNE
> > > +(X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > "accumulate_outgoing_args",
> > > +   m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM |
> > m_AMD_MULTIPLE |
> > > +m_GENERIC)
> > > +
> > > +/* X86_TUNE_ALWAYS_FANCY_MATH_387: controls use of fancy 387
> > operations,
> > > +   such as fsqrt, fprem, fsin, fcos, fsincos etc.
> > > +   Should be enabled for all targets that always has coprocesor.  */
> > > +DEF_TUNE (X86_TUNE_ALWAYS_FANCY_MATH_387,
> > "always_fancy_math_387",
> > > +          ~(m_386 | m_486))
> > > Index: config/i386/i386.c
> > >
> > ==========================================================
> > =========
> > > --- config/i386/i386.c    (revision 203380)
> > > +++ config/i386/i386.c    (working copy)
> > > @@ -1898,18 +1898,6 @@ static unsigned int initial_ix86_arch_fe
> > >    ~m_386,
> > >  };
> > >
> > > -static const unsigned int x86_accumulate_outgoing_args
> > > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM | m_AMD_MULTIPLE
> > |
> > > m_GENERIC;
> > > -
> > > -static const unsigned int x86_arch_always_fancy_math_387
> > > -  = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM |
> > m_SLM |
> > > m_AMD_MULTIPLE | m_GENERIC;
> > > -
> > > -static const unsigned int x86_avx256_split_unaligned_load
> > > -  = m_COREI7 | m_GENERIC;
> > > -
> > > -static const unsigned int x86_avx256_split_unaligned_store
> > > -  = m_COREI7 | m_BDVER | m_GENERIC;
> > > -
> > >  /* In case the average insn count for single function invocation is
> > >     lower than this constant, emit fast (but longer) prologue and
> > >     epilogue code.  */
> > > @@ -2920,7 +2908,7 @@ static void
> > >  ix86_option_override_internal (bool main_args_p)  {
> > >    int i;
> > > -  unsigned int ix86_arch_mask, ix86_tune_mask;
> > > +  unsigned int ix86_arch_mask;
> > >    const bool ix86_tune_specified = (ix86_tune_string != NULL);
> > >    const char *prefix;
> > >    const char *suffix;
> > > @@ -3673,7 +3661,7 @@ ix86_option_override_internal (bool main
> > >
> > >    /* If the architecture always has an FPU, turn off NO_FANCY_MATH_387,
> > >       since the insns won't need emulation.  */
> > > -  if (x86_arch_always_fancy_math_387 & ix86_arch_mask)
> > > +  if (ix86_tune_features [X86_TUNE_ALWAYS_FANCY_MATH_387])
> > >      target_flags &= ~MASK_NO_FANCY_MATH_387;
> > >
> > >    /* Likewise, if the target doesn't have a 387, or we've specified @@ -
> > 3805,8 +3793,7 @@ ix86_option_override_internal (bool main
> > >   gcc_unreachable ();
> > >        }
> > >
> > > -  ix86_tune_mask = 1u << ix86_tune;
> > > -  if ((x86_accumulate_outgoing_args & ix86_tune_mask)
> > > +  if (ix86_tune_features [X86_TUNE_ACCUMULATE_OUTGOING_ARGS]
> > >        && !(target_flags_explicit & MASK_ACCUMULATE_OUTGOING_ARGS)
> > >        && !optimize_size)
> > >      target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS; @@ -3946,10
> > +3933,10 @@ ix86_option_override_internal (bool main
> > >        if (flag_expensive_optimizations
> > >     && !(target_flags_explicit & MASK_VZEROUPPER))
> > >   target_flags |= MASK_VZEROUPPER;
> > > -      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
> > > +      if
> > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
> > >     && !(target_flags_explicit &
> > MASK_AVX256_SPLIT_UNALIGNED_LOAD))
> > >   target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
> > > -      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
> > > +      if
> > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
> > >     && !(target_flags_explicit &
> > MASK_AVX256_SPLIT_UNALIGNED_STORE))
> > >   target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
> > >        /* Enable 128-bit AVX instruction generation

Re: Honnor ix86_accumulate_outgoing_args again

Reply via email to