Jan,

Please see my answers below

> -----Original Message-----
> From: Jan Hubicka [mailto:hubi...@ucw.cz]
> Sent: Sunday, October 20, 2013 12:30 AM
> To: Zamyatin, Igor; gcc-patches@gcc.gnu.org; vmaka...@redhat.com
> Cc: 'Jan Hubicka'
> Subject: Re: Honnor ix86_accumulate_outgoing_args again
> 
> > Jan,
> >
> > Does this seem reasonable to you?
> 
> Oops, sorry, I missed your email. (I was travelling and I am finishing a paper
> now).
> >
> > Thanks,
> > Igor
> >
> > > -----Original Message-----
> > > From: Zamyatin, Igor
> > > Sent: Tuesday, October 15, 2013 3:48 PM
> > > To: Jan Hubicka
> > > Subject: RE: Honnor ix86_accumulate_outgoing_args again
> > >
> > > Jan,
> > >
> > > Now we have following prologue in, say, phi0 routine in equake
> > >
> > > 0x804aa90 1  push   %ebp
> > > 0x804aa91 2  mov    %esp,%ebp
> > > 0x804aa93 3  sub    $0x18,%esp
> > > 0x804aa96 4  vmovsd 0x80ef7a8,%xmm0
> > > 0x804aa9e 5  vmovsd 0x8(%ebp),%xmm1
> > > 0x804aaa3 6  vcomisd %xmm1,%xmm0   <-- we see big stall somewhere
> here
> > > or 1-2 instructions above
> > >
> > > While earlier it was
> > >
> > > 0x804abd0 1 sub    $0x2c,%esp
> > > 0x804abd3 2 vmovsd 0x30(%esp),%xmm1
> > > 0x804abd9 3 vmovsd 0x80efcc8,%xmm0
> > > 0x804abe1 4 vcomisd %xmm1,%xmm0
> 
> Thanks for analysis! It is a different benchmark than for bulldozer, but
> apparently same case.  Again we used to eliminate frame pointer here but
> IRS now doesn't Do you see the same regression with -fno-omit-frame-
> pointer -maccumulate-outgoing-args?

No, with these flags performance is back

> 
> I suppose this is a conflict in between the push instruction hanled by stack
> engine and initialization of EBP that isn't.  That would explain why bulldozer
> don't seem to care about this particular benchmark (its stack engine seems to
> have quite different design).
> 
> This is a bit sad situation - accumulate-outgoing-args is expensive code size
> wise and it seems we don't really need esp with -mno-accumulate-outgoing-
> args.

Could you please explain this a bit more?

Thanks,
Igor

> The non-accumulation code path was mistakely disabled for too long ;(
> 
> Vladimir, how much effort do you think it will be to fix the frame pointer
> elimination here?
> I can imagine it is a quite tricky case. If so I would suggest adding
> m_CORE_ALL to X86_TUNE_ACCUMULATE_OUTGOING_ARGS with a
> comment explaining the problem and mentioning the regression on equake
> on core and mgrid on Bulldizer and opening an enhancement request for
> this...
> 
> I also wonder if direct ESP use and push/pop instructions are causing so
> noticeable issues, I wonder if we can't "shrink wrap" this into red-zone in 
> the
> 64bit compilation.  It seems that even with -maccumulate-outgoing-args
> pushing the frame allocation as late as possible in the function would be a
> good idea so it is not close to the push/pop/call/ret.
> 
> Honza
> 
> > >
> > > Note that for Atom and SLM no regression happens because they are
> > > already in x86_accumulate_outgoing_args. As for other machines  -
> > > seems now (after your change) they don't get that
> > > MASK_ACCUMULATE_OUTGOING_ARGS and it leads to using ebp in the
> > > prologue.
> > >
> > > Thanks,
> > > Igor
> > >
> > > -----Original Message-----
> > > From: Jan Hubicka [mailto:hubi...@ucw.cz]
> > > Sent: Monday, October 14, 2013 8:44 PM
> > > To: Zamyatin, Igor
> > > Cc: Jan Hubicka
> > > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > >
> > > > Jan,
> > > >
> > > > I see that you haven't committed this change. Any particular
> > > > reason for
> > > this?
> > >
> > > No, seems I forgot about it.
> > > >
> > > > BTW, after r203171 (http://gcc.gnu.org/ml/gcc-cvs/2013-
> > > 10/msg00122.html) we see lot of performance degradation on spec2000
> > > and
> > > spec2006 tests on SandyBridge and IvyBridge in 32 bits mode. E.g.
> > > 183.equake got ~-15%. Have you seen it?
> > >
> > > I tested this only on Bulldozer chips where I saw large regression
> > > from mgrid, but curiously not from equake.  I tracked it down to
> > > frame pointer being forced by IRA that Vladimir promised to look at later.
> > >
> > > I assumed that arg accumulation was tested before the flags was set,
> > > so I did not benchmarked chips that previously dsabled it. Perhaps
> > > things changed because IRA was merged in meantime while this feature
> > > was accidentally disabled.
> > >
> > > It would be great if you can figure out why equake regress - in FP
> > > benchmarks we do not use push/pop so perhaps we just end up emitting
> > > something really stupid or we manage to confuse stack engine...
> > >
> > > Honza
> > >
> > > >
> > > > Thanks,
> > > > Igor
> > > >
> > > > -----Original Message-----
> > > > From: gcc-patches-ow...@gcc.gnu.org
> > > > [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Jan Hubicka
> > > > Sent: Thursday, October 10, 2013 10:40 PM
> > > > To: Jan Hubicka
> > > > Cc: Vladimir Makarov; gcc-patches@gcc.gnu.org;
> > > > hjl.to...@gmail.com; ubiz...@gmail.com; r...@redhat.com;
> > > ganesh.gopalasubraman...@amd.com
> > > > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > > >
> > > > Hi,
> > > > this patch makes ACCUMULATE_OUTGOING_ARGS to disable itself
> when
> > > function is cold.  I did some extra testing and to my amusement we
> > > now seem to output more compact unwind info when
> > > ACCUMULATE_OUTGOING_ARGS is disabled, so this seems quite
> consistent
> > > code size win.
> > > >
> > > > We actually can do better and enable ACCUMULATE_OUTGOING_ARGS
> > > only when function contains hot calls.  This should also avoid need
> > > for frame allocation in prologue/epilogue on hot path then. I will
> > > look into this incrementally..
> > > >
> > > > I also noticed that we still have some tuning flags in i386.c
> > > > rather than in
> > > x86-tune.c so I moved them there.
> > > >
> > > > Testing x86_64-linux and will commit it once testing converge.
> > > >
> > > > Honza
> > > >         * config/i386/i386.h (ACCUMULATE_OUTGOING_ARGS): Disable
> > > accumulation
> > > >         for cold functions.
> > > >         * x86-tune.def (X86_TUNE_USE_LEAVE): Update comment.
> > > >         (X86_TUNE_PUSH_MEMORY): Likewise.
> > > >         (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > > >         X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL): New.
> > > >         (X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > > X86_TUNE_ALWAYS_FANCY_MATH_387): New.
> > > >         * i386.c (x86_accumulate_outgoing_args,
> > > x86_arch_always_fancy_math_387,
> > > >         x86_avx256_split_unaligned_load,
> > > x86_avx256_split_unaligned_store):
> > > >         Remove.
> > > >         (ix86_option_override_internal): Update to use tune features
> > > instead
> > > >         of variables.
> > > > Index: config/i386/i386.h
> > > >
> > >
> ==========================================================
> > > =========
> > > > --- config/i386/i386.h  (revision 203380)
> > > > +++ config/i386/i386.h  (working copy)
> > > > @@ -1492,13 +1492,26 @@ enum reg_class
> > > >     will be computed and placed into the variable `crtl-
> >outgoing_args_size'.
> > > >     No space will be pushed onto the stack for each call; instead, the
> > > >     function prologue should increase the stack frame size by this
> amount.
> > > > +
> > > > +   In 32bit mode enabling argument accumulation results in about
> > > > + 5% code
> > > size
> > > > +   growth becuase move instructions are less compact than push.  In
> 64bit
> > > > +   mode the difference is less drastic but visible.
> > > > +
> > > > +   FIXME: Unlike earlier implementations, the size of unwind info
> seems to
> > > > +   actually grouw with accumulation.  Is that because accumulated args
> > > > +   unwind info became unnecesarily bloated?
> > > >
> > > >     64-bit MS ABI seem to require 16 byte alignment everywhere except
> for
> > > > -   function prologue and apilogue.  This is not possible without
> > > > -   ACCUMULATE_OUTGOING_ARGS.  */
> > > > +   function prologue and epilogue.  This is not possible without
> > > > +   ACCUMULATE_OUTGOING_ARGS.
> > > > +
> > > > +   If stack probes are required, the space used for large function
> > > > +   arguments on the stack must also be probed, so enable
> > > > +   -maccumulate-outgoing-args so this happens in the prologue.
> > > > + */
> > > >
> > > >  #define ACCUMULATE_OUTGOING_ARGS \
> > > > -  (TARGET_ACCUMULATE_OUTGOING_ARGS ||
> TARGET_64BIT_MS_ABI)
> > > > +  ((TARGET_ACCUMULATE_OUTGOING_ARGS &&
> > > optimize_function_for_speed_p (cfun)) \
> > > > +   || TARGET_STACK_PROBE || TARGET_64BIT_MS_ABI)
> > > >
> > > >  /* If defined, a C expression whose value is nonzero when we want
> > > > to use
> > > PUSH
> > > >     instructions to pass outgoing arguments.  */
> > > > Index: config/i386/x86-tune.def
> > > >
> > >
> ==========================================================
> > > =========
> > > > --- config/i386/x86-tune.def    (revision 203387)
> > > > +++ config/i386/x86-tune.def    (working copy)
> > > > @@ -18,15 +18,13 @@ a copy of the GCC Runtime Library Except  see
> > > > the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> > > > <http://www.gnu.org/licenses/>.  */
> > > >
> > > > -/* X86_TUNE_USE_LEAVE: Leave does not affect Nocona SPEC2000
> results
> > > > -   negatively, so enabling for Generic64 seems like good code size
> > > > -   tradeoff.  We can't enable it for 32bit generic because it does not
> > > > -   work well with PPro base chips.  */
> > > > +/* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where
> > > > +it fits.  */
> > > >  DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
> > > >           m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE |
> > > m_GENERIC)
> > > >
> > > >  /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem"
> > > instructions.
> > > > -   Some chips, like 486 and Pentium have problems with these
> sequences..
> > > */
> > > > +   Some chips, like 486 and Pentium works faster with separate load
> > > > +   and push instructions.  */
> > > >  DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory",
> > > >            m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE |
> > > m_AMD_MULTIPLE
> > > >            | m_GENERIC)
> > > > @@ -210,6 +208,16 @@ DEF_TUNE
> > > (X86_TUNE_SSE_UNALIGNED_LOAD_OP  DEF_TUNE
> > > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL,
> > > "sse_unaligned_store_optimal",
> > > >            m_COREI7 | m_BDVER | m_SLM | m_GENERIC)
> > > >
> > > > +/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true,
> unaligned
> > > loads are
> > > > +   split.  */
> > > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > > "256_unaligned_load_optimal",
> > > > +          ~(m_COREI7 | m_GENERIC))
> > > > +
> > > > +/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true,
> unaligned
> > > loads are
> > > > +   split.  */
> > > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL,
> > > "256_unaligned_load_optimal",
> > > > +          ~(m_COREI7 | m_BDVER | m_GENERIC))
> > > > +
> > > >  /* Use packed single precision instructions where posisble.  I.e.
> > > > movups
> > > instead
> > > >     of movupd.  */
> > > >  DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
> > > "sse_packed_single_insn_optimal", @@ -398,3 +406,24 @@ DEF_TUNE
> > > (X86_TUNE_AVOID_MEM_OPND_FOR_CM
> > > >     fp converts to destination register.  */  DEF_TUNE
> > > (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS,
> > > "split_mem_opnd_for_fp_converts",
> > > >            m_SLM)
> > > > +
> > > > +/* X86_TUNE_ACCUMULATE_OUTGOING_ARGS: Allocate stack space
> for
> > > outgoing
> > > > +   arguments in prologue/epilogue instead of separately for each call
> > > > +   by push/pop instructions.
> > > > +   This increase code size by about 5% in 32bit mode, less so in 64bit
> mode
> > > > +   because parameters are passed in registers.  It is considerable
> > > > +   win for targets without stack engine that prevents multple
> > > > + push
> > > operations
> > > > +   to happen in parallel.
> > > > +
> > > > +   FIXME: the flags is incorrectly enabled for amdfam10, Bulldozer,
> > > > +   Bobcat and Generic.  This is because disabling it causes large
> > > > +   regression on mgrid due to IRA limitation leading to unecessary
> > > > +   use of the frame pointer in 32bit mode.  */ DEF_TUNE
> > > > +(X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > > "accumulate_outgoing_args",
> > > > +         m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM |
> > > m_AMD_MULTIPLE |
> > > > +m_GENERIC)
> > > > +
> > > > +/* X86_TUNE_ALWAYS_FANCY_MATH_387: controls use of fancy 387
> > > operations,
> > > > +   such as fsqrt, fprem, fsin, fcos, fsincos etc.
> > > > +   Should be enabled for all targets that always has coprocesor.
> > > > +*/ DEF_TUNE (X86_TUNE_ALWAYS_FANCY_MATH_387,
> > > "always_fancy_math_387",
> > > > +          ~(m_386 | m_486))
> > > > Index: config/i386/i386.c
> > > >
> > >
> ==========================================================
> > > =========
> > > > --- config/i386/i386.c  (revision 203380)
> > > > +++ config/i386/i386.c  (working copy)
> > > > @@ -1898,18 +1898,6 @@ static unsigned int initial_ix86_arch_fe
> > > >    ~m_386,
> > > >  };
> > > >
> > > > -static const unsigned int x86_accumulate_outgoing_args
> > > > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM |
> m_AMD_MULTIPLE
> > > |
> > > > m_GENERIC;
> > > > -
> > > > -static const unsigned int x86_arch_always_fancy_math_387
> > > > -  = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM |
> > > m_SLM |
> > > > m_AMD_MULTIPLE | m_GENERIC;
> > > > -
> > > > -static const unsigned int x86_avx256_split_unaligned_load
> > > > -  = m_COREI7 | m_GENERIC;
> > > > -
> > > > -static const unsigned int x86_avx256_split_unaligned_store
> > > > -  = m_COREI7 | m_BDVER | m_GENERIC;
> > > > -
> > > >  /* In case the average insn count for single function invocation is
> > > >     lower than this constant, emit fast (but longer) prologue and
> > > >     epilogue code.  */
> > > > @@ -2920,7 +2908,7 @@ static void
> > > >  ix86_option_override_internal (bool main_args_p)  {
> > > >    int i;
> > > > -  unsigned int ix86_arch_mask, ix86_tune_mask;
> > > > +  unsigned int ix86_arch_mask;
> > > >    const bool ix86_tune_specified = (ix86_tune_string != NULL);
> > > >    const char *prefix;
> > > >    const char *suffix;
> > > > @@ -3673,7 +3661,7 @@ ix86_option_override_internal (bool main
> > > >
> > > >    /* If the architecture always has an FPU, turn off
> NO_FANCY_MATH_387,
> > > >       since the insns won't need emulation.  */
> > > > -  if (x86_arch_always_fancy_math_387 & ix86_arch_mask)
> > > > +  if (ix86_tune_features [X86_TUNE_ALWAYS_FANCY_MATH_387])
> > > >      target_flags &= ~MASK_NO_FANCY_MATH_387;
> > > >
> > > >    /* Likewise, if the target doesn't have a 387, or we've
> > > > specified @@ -
> > > 3805,8 +3793,7 @@ ix86_option_override_internal (bool main
> > > >         gcc_unreachable ();
> > > >        }
> > > >
> > > > -  ix86_tune_mask = 1u << ix86_tune;
> > > > -  if ((x86_accumulate_outgoing_args & ix86_tune_mask)
> > > > +  if (ix86_tune_features
> [X86_TUNE_ACCUMULATE_OUTGOING_ARGS]
> > > >        && !(target_flags_explicit &
> MASK_ACCUMULATE_OUTGOING_ARGS)
> > > >        && !optimize_size)
> > > >      target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS; @@ -
> 3946,10
> > > +3933,10 @@ ix86_option_override_internal (bool main
> > > >        if (flag_expensive_optimizations
> > > >           && !(target_flags_explicit & MASK_VZEROUPPER))
> > > >         target_flags |= MASK_VZEROUPPER;
> > > > -      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
> > > > +      if
> > > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
> > > >           && !(target_flags_explicit &
> > > MASK_AVX256_SPLIT_UNALIGNED_LOAD))
> > > >         target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
> > > > -      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
> > > > +      if
> > > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
> > > >           && !(target_flags_explicit &
> > > MASK_AVX256_SPLIT_UNALIGNED_STORE))
> > > >         target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
> > > >        /* Enable 128-bit AVX instruction generation

Reply via email to