Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-17 Thread Igor Zamyatin
On Thu, Apr 12, 2012 at 3:16 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Thu, Apr 12, 2012 at 1:05 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Wed, Apr 11, 2012 at 12:39 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi All!

 Here is a patch that enables unroll at O2 for Atom.

 This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
 bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
 bits).

 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
 why? Why do you disable register renaming?  check_imull requires a function
 comment.

 Sure, enabling unroll for O3 could be the next step.
 We can't avoid code size increase with unroll - what number do you
 think will be appropriate?
 Register renaming was the reason of several degradations during tuning 
 process
 Comment for check_imull was added


 This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

 Why? EEMBC was measured and result provided here just because this
 benchmark considers to be very relevant for Atom

 I'd say that SPEC INT (2000 / 2006) is more relevant for Atom (SPEC FP
 would be irrelevant OTOH).  Similar code size for, say, Mozilla Firefox
 or GCC itself would be important.

 -O2 is not supposed to give best benchmark results.

 O2 is wide-used so performance improvement could be important for users.

 But not at a 5% size cost.  Please also always check the compile-time effect
 which is important for -O2 as well.

What would be an acceptable number of size cost/compile-time increase
for O2 and O3 on EEMBC, SPEC INT 2000 and Mozilla?

Is it possible in common to put Atom-specific unroll heuristics under
some option which could be mentioned in GCC docs?


 Richard.


 Thanks,
 Richard.


 Tested for i386 and x86-64, ok for trunk?

 Updated patch attached


 Thanks,
 Igor

 ChangeLog:

 2012-04-10  Yakovlev Vladimir  vladimir.b.yakov...@intel.com

        * gcc/config/i386/i386.c (check_imul): New routine.
        (ix86_loop_unroll_adjust): New target hook.
        (ix86_option_override_internal): Enable unrolling on Atom at -O2.
        (TARGET_LOOP_UNROLL_ADJUST): New define.

Thanks,
Igor


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-12 Thread Igor Zamyatin
On Wed, Apr 11, 2012 at 5:34 PM, Andi Kleen a...@firstfloor.org wrote:
 Richard Guenther richard.guent...@gmail.com writes:

 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
 why? Why do you disable register renaming?  check_imull requires a function
 comment.

 This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

 -O2 is not supposed to give best benchmark results.

 Besides it is against the Intel Optimization Manual recommendation
 to prefer small code on Atom to avoid falling out of the predecode hints
 in the cache.

Yes, this is well-known concern for Atom. But in the same time unroll
could help a lot for inorder machines because it could provide more
opportunities to a compiler scheduler. And experiments showed that
unroll could really help.


 So would need much more benchmarking on macro workloads first at least.

Like what, for example? I believe in this case everything also
strongly depends on test usage model (e.g. it usually compiled with Os
not O2) and, let's say, internal test structure - whether there are
hot loops that suitable for unroll.


 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-12 Thread Igor Zamyatin
On Wed, Apr 11, 2012 at 12:39 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi All!

 Here is a patch that enables unroll at O2 for Atom.

 This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
 bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
 bits).

 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
 why? Why do you disable register renaming?  check_imull requires a function
 comment.

Sure, enabling unroll for O3 could be the next step.
We can't avoid code size increase with unroll - what number do you
think will be appropriate?
Register renaming was the reason of several degradations during tuning process
Comment for check_imull was added


 This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

Why? EEMBC was measured and result provided here just because this
benchmark considers to be very relevant for Atom


 -O2 is not supposed to give best benchmark results.

O2 is wide-used so performance improvement could be important for users.


 Thanks,
 Richard.


 Tested for i386 and x86-64, ok for trunk?

Updated patch attached


 Thanks,
 Igor

 ChangeLog:

 2012-04-10  Yakovlev Vladimir  vladimir.b.yakov...@intel.com

        * gcc/config/i386/i386.c (check_imul): New routine.
        (ix86_loop_unroll_adjust): New target hook.
        (ix86_option_override_internal): Enable unrolling on Atom at -O2.
        (TARGET_LOOP_UNROLL_ADJUST): New define.


unroll1.patch
Description: Binary data


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-12 Thread Richard Guenther
On Thu, Apr 12, 2012 at 1:05 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Wed, Apr 11, 2012 at 12:39 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi All!

 Here is a patch that enables unroll at O2 for Atom.

 This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
 bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
 bits).

 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
 why? Why do you disable register renaming?  check_imull requires a function
 comment.

 Sure, enabling unroll for O3 could be the next step.
 We can't avoid code size increase with unroll - what number do you
 think will be appropriate?
 Register renaming was the reason of several degradations during tuning process
 Comment for check_imull was added


 This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

 Why? EEMBC was measured and result provided here just because this
 benchmark considers to be very relevant for Atom

I'd say that SPEC INT (2000 / 2006) is more relevant for Atom (SPEC FP
would be irrelevant OTOH).  Similar code size for, say, Mozilla Firefox
or GCC itself would be important.

 -O2 is not supposed to give best benchmark results.

 O2 is wide-used so performance improvement could be important for users.

But not at a 5% size cost.  Please also always check the compile-time effect
which is important for -O2 as well.

Richard.


 Thanks,
 Richard.


 Tested for i386 and x86-64, ok for trunk?

 Updated patch attached


 Thanks,
 Igor

 ChangeLog:

 2012-04-10  Yakovlev Vladimir  vladimir.b.yakov...@intel.com

        * gcc/config/i386/i386.c (check_imul): New routine.
        (ix86_loop_unroll_adjust): New target hook.
        (ix86_option_override_internal): Enable unrolling on Atom at -O2.
        (TARGET_LOOP_UNROLL_ADJUST): New define.


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-12 Thread Andi Kleen
  So would need much more benchmarking on macro workloads first at least.
 
 Like what, for example? I believe in this case everything also
 strongly depends on test usage model (e.g. it usually compiled with Os
 not O2) and, let's say, internal test structure - whether there are
 hot loops that suitable for unroll.

Normally the compiler doesn't know if a loop is hot unless you use
profile feedback. So worst case on a big code base you may end up
with a lot of unnecessary unrolling. On cold code it's just wasted
bytes, but there could be already icache limited code where it 
would be worse.

How about just a compiler bootstrap on Atom as a worst case?
For the benchmark can you use profile feedback?

BTW I know some loops are unrolled at -O3 by default at tree level because
the vectorizer likes it.  I actually have an older patch to dial this
down for some common cases.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-11 Thread Richard Guenther
On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi All!

 Here is a patch that enables unroll at O2 for Atom.

 This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
 bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
 bits).

5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
why? Why do you disable register renaming?  check_imull requires a function
comment.

This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

-O2 is not supposed to give best benchmark results.

Thanks,
Richard.


 Tested for i386 and x86-64, ok for trunk?

 Thanks,
 Igor

 ChangeLog:

 2012-04-10  Yakovlev Vladimir  vladimir.b.yakov...@intel.com

        * gcc/config/i386/i386.c (check_imull): New routine.
        (ix86_loop_unroll_adjust): New target hook.
        (ix86_option_override_internal): Enable unrolling on Atom at -O2.
        (TARGET_LOOP_UNROLL_ADJUST): New define.


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-11 Thread Andi Kleen
Richard Guenther richard.guent...@gmail.com writes:

 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
 why? Why do you disable register renaming?  check_imull requires a function
 comment.

 This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

 -O2 is not supposed to give best benchmark results.

Besides it is against the Intel Optimization Manual recommendation
to prefer small code on Atom to avoid falling out of the predecode hints
in the cache.

So would need much more benchmarking on macro workloads first at least.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only