Re: AVX generic mode tuning discussion.

2013-01-08 Thread Richard Biener
On Mon, Jan 7, 2013 at 7:21 PM, Jagasia, Harsha harsha.jaga...@amd.com wrote:
 We would like to propose changing AVX generic mode tuning to generate
 128-bit AVX instead of 256-bit AVX.

 You indicate a 3% reduction on bulldozer with avx256.
 How does avx128 compare to -mno-avx -msse4.2?
 Will the next AMD generation have a useable avx256?

 I'm not keen on the idea of generic mode being tune for a single
 processor revision that maybe shouldn't actually be using avx at all.

Btw, it looks like the data is massively skewed by 436.cactusADM.  What are 
the overall numbers if you disregard cactus?  It's also for sure the case 
that the vectorizer cost model has not been touched for avx256 vs. avx128 vs. 
sse, so a more sensible approach would be to look at differentiating things 
there to improve the cactus numbers.

Harsha, did you investigate why avx256 is such a loss for cactus or why it is 
so much of a win for SB?

 I know this thread did not get closed from our end for a while now, but we 
 (AMD) would really like to re-open this discussion. So here goes.

 We did investigate why cactus is slower in avx-256 mode than avx-128 mode on 
 AMD processors.

 Using -Ofast flag (with appropriate flags to generate avx-128 code or 
 avx-256 code) and running with the reference data set, we observe the 
 following runtimes on Bulldozer.
 Runtime %Diff 
 AVX-256 versus AVX-128
 AVX128616s  38%
 AVX256 with store splitting  853s

 Scheduling and predictive commoning are turned off in the compiler for both 
 cases, so that the code generated by the compiler for the avx-128 and avx-256 
 cases are mostly equivalent i.e only avx-128 instructions on one side are 
 being replaced by avx-256 instructions on the other side.

 Looking at the cactus source and oprofile reports, the hottest loop nest is a 
 triple nested loop. The innermost loop of this nest has ~400 lines of Fortran 
 code and takes up 99% of the run time of the benchmark.

 Gcc vectorizes the innermost loop for both the 128 and 256 bit cases. In 
 order to vectorize the innermost loop, gcc generates a SIMD scalar prologue 
 loop to align the relevant vectors, followed by a SIMD packed avx loop, 
 followed by a SIMD scalar epilogue loop to handle what's left after a whole 
 multiple of vector factor is taken care of.

 Here are the oprofile samples seen in the AVX-128 and AVX-256 case for the 
 innermost Fortran loop's 3 components.
 Oprofile Samples
 AVX 128   
 AVX-256-ss Gap in samples Gap 
 as % of total runtime
 Total  153408 
  214448  61040
 38%
 SIMD Vector loop135653
   183074  4742130%
 SIMD Scalar Prolog loop3817   
 104346617   4%
 SIMD Scalar Epilog loop 3471  
  100726601   4%

 The avx-256 code is spending 30% more time in the SIMD vector loop than the 
 avx-128 code. The code gen appears to be equivalent for this vector loop in 
 the 128b and 256b cases- i.e only avx-128 instructions on one side are being 
 replaced by avx-256 instructions on the other side. The instruction mix and 
 scheduling are same, except for the spilling and loading of one variable.

 We know this gap is because there are fewer physical registers available for 
 renaming to the avx-256 code, since our processor loses the upper halves of 
 the FP registers for renaming.
 Our entire SIMD pipeline in the processor  is 128-bit and we don't have 
 native true 256-bit, even for foreseeable future generations, unlike 
 Sandybridge/Ivybridge.

 The avx-256 code is spending 8% more time in the SIMD scalar prologue and 
 epilogue than the avx-128 code. The code gen is exactly the same for these 
 scalar loops in the 128b and 256b case - i.e exact same instruction mix and 
 scheduling. The reason for the gap is actually the number of iterations that 
 gcc executes in these loops for the 2 cases.

 This is because gcc is following Sandy bridge's recommendation and aligning 
 avx-256 vectors to a 32-byte boundary instead of a 16-byte boundary, even on 
 Bulldozer.
 The Sandybridge Software Optimization Guide mentions that the optimal memory 
 alignment of an AVX 256-bit vector, stored in memory, is 32 bytes.
 The Bulldozer Software Optimization Guide says Align all packed 
 floating-point data on 16-byte boundaries.

 In case of cactus, the relevant double vector has 118 elements 

FW: AVX generic mode tuning discussion.

2013-01-07 Thread Jagasia, Harsha
 We would like to propose changing AVX generic mode tuning to 
 generate 128-bit AVX instead of 256-bit AVX.

 You indicate a 3% reduction on bulldozer with avx256.
 How does avx128 compare to -mno-avx -msse4.2?
 Will the next AMD generation have a useable avx256?

 I'm not keen on the idea of generic mode being tune for a single 
 processor revision that maybe shouldn't actually be using avx at all.

Btw, it looks like the data is massively skewed by 436.cactusADM.  What are 
the overall numbers if you disregard cactus?  It's also for sure the case that 
the vectorizer cost model has not been touched for avx256 vs. avx128 vs. sse, 
so a more sensible approach would be to look at differentiating things there 
to improve the cactus numbers. 

Harsha, did you investigate why avx256 is such a loss for cactus or why it is 
so much of a win for SB?

I know this thread did not get closed from our end for a while now, but we 
(AMD) would really like to re-open this discussion. So here goes.

We did investigate why cactus is slower in avx-256 mode than avx-128 mode on 
AMD processors.

Using -Ofast flag (with appropriate flags to generate avx-128 code or avx-256 
code) and running with the reference data set, we observe the following 
runtimes on Bulldozer. 
Runtime %Diff 
AVX-256 versus AVX-128
AVX128616s  38%
AVX256 with store splitting  853s

Scheduling and predictive commoning are turned off in the compiler for both 
cases, so that the code generated by the compiler for the avx-128 and avx-256 
cases are mostly equivalent i.e only avx-128 instructions on one side are being 
replaced by avx-256 instructions on the other side.

Looking at the cactus source and oprofile reports, the hottest loop nest is a 
triple nested loop. The innermost loop of this nest has ~400 lines of Fortran 
code and takes up 99% of the run time of the benchmark. 

Gcc vectorizes the innermost loop for both the 128 and 256 bit cases. In order 
to vectorize the innermost loop, gcc generates a SIMD scalar prologue loop to 
align the relevant vectors, followed by a SIMD packed avx loop, followed by a 
SIMD scalar epilogue loop to handle what's left after a whole multiple of 
vector factor is taken care of. 

Here are the oprofile samples seen in the AVX-128 and AVX-256 case for the 
innermost Fortran loop's 3 components. 
Oprofile Samples
AVX 128 
  AVX-256-ss Gap in samples Gap as 
% of total runtime
Total  153408   
   214448  61040
38%
SIMD Vector loop135653  
183074  4742130%
SIMD Scalar Prolog loop3817 
  104346617   4%
SIMD Scalar Epilog loop 3471
   100726601   4%

The avx-256 code is spending 30% more time in the SIMD vector loop than the 
avx-128 code. The code gen appears to be equivalent for this vector loop in the 
128b and 256b cases- i.e only avx-128 instructions on one side are being 
replaced by avx-256 instructions on the other side. The instruction mix and 
scheduling are same, except for the spilling and loading of one variable.

We know this gap is because there are fewer physical registers available for 
renaming to the avx-256 code, since our processor loses the upper halves of the 
FP registers for renaming.
Our entire SIMD pipeline in the processor  is 128-bit and we don't have native 
true 256-bit, even for foreseeable future generations, unlike 
Sandybridge/Ivybridge.

The avx-256 code is spending 8% more time in the SIMD scalar prologue and 
epilogue than the avx-128 code. The code gen is exactly the same for these 
scalar loops in the 128b and 256b case - i.e exact same instruction mix and 
scheduling. The reason for the gap is actually the number of iterations that 
gcc executes in these loops for the 2 cases.  

This is because gcc is following Sandy bridge's recommendation and aligning 
avx-256 vectors to a 32-byte boundary instead of a 16-byte boundary, even on 
Bulldozer. 
The Sandybridge Software Optimization Guide mentions that the optimal memory 
alignment of an AVX 256-bit vector, stored in memory, is 32 bytes. 
The Bulldozer Software Optimization Guide says Align all packed floating-point 
data on 16-byte boundaries.

In case of cactus, the relevant double vector has 118 elements that are stepped 
through in unit stride and the first element handled in the Fortran loop is 
aligned at a boundary akin to 0x8. 

RE: AVX generic mode tuning discussion.

2011-11-02 Thread Jagasia, Harsha
We would like to propose changing AVX generic mode tuning to
  generate
   128-bit
AVX instead of 256-bit AVX.
  
   You indicate a 3% reduction on bulldozer with avx256.
   How does avx128 compare to -mno-avx -msse4.2?
 
  We see these % differences going from SSE42 to AVX128 to AVX256 on
  Bulldozer with -mtune=generic -Ofast.
  (Positive is improvement, negative is degradation)
 
  Bulldozer:
                        AVX128/SSE42    AVX256/AVX-128
  410.bwaves            -1.4%                   -1.4%
  416.gamess            -1.1%                   0.0%
  433.milc              0.5%                    -2.4%
  434.zeusmp            9.7%                    -2.1%
  435.gromacs           5.1%                    0.5%
  436.cactusADM 8.2%                    -23.8%
  437.leslie3d  8.1%                    0.4%
  444.namd              3.6%                    0.0%
  447.dealII            -1.4%                   -0.4%
  450.soplex            -0.4%                   -0.4%
  453.povray            0.0%                    -1.5%
  454.calculix  15.7%                   -8.3%
  459.GemsFDTD  4.9%                    1.4%
  465.tonto             1.3%                    -0.6%
  470.lbm               0.9%                    0.3%
  481.wrf               7.3%                    -3.6%
  482.sphinx3           5.0%                    -9.8%
  SPECFP                3.8%                    -3.2%
 
   Will the next AMD generation have a useable avx256?
   I'm not keen on the idea of generic mode being tune
   for a single processor revision that maybe shouldn't
   actually be using avx at all.
 
  We see a substantial gain in several SPECFP benchmarks going from
 SSE42
  to AVX128 on Bulldozer.
  IMHO, accomplishing even a 5% gain in an individual benchmark takes
 a
  hardware company several man months.
  The loss with AVX256 for Bulldozer is much more significant than the
  gain for SandyBridge.
  While the general trend in the industry is a move toward AVX256, for
  now we would be disadvantaging Bulldozer with this choice.
 
  We have several customers who use -mtune=generic and it is default,
  unless a user explicitly overrides it with -mtune=native. They are
 the
  ones who want to experiment with latest ISA using gcc, but want to
 keep
  their ISA selection and tuning agnostic on x86/64. IMHO, it is with
  these customers in mind that generic was introduced in the first
 place.
 
  Since stage 1 closure is around the corner, just wanted to ping to
 see if the maintainers have made up their mind on this one.
  AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes
 out pretty much all of that gain in generic mode.
  Until there is a convergence on AVX-256 for x86/64, we would like to
 propose having generic generate avx-128 by default and have a user
 override to avx-256 manually when known to benefit performance.
 
 Did somebody spend the time analyzing why CactusADM shows so much of a
 difference?  
 With the recent improvements in vectorizing for AVX, did
 you
 re-do the measurements with a recent trunk?
 
 I don't think disabling avx-256 by default is a good idea until we
 understand why these numbers happen and are convinced we cannot fix
 this by proper
 cost modeling.

We have observed cases where AVX-256 bit code is slower than AVX-128 bit code 
on Bulldozer. This is because internally the front end, data paths etc for 
Bulldozer are designed for optimal AVX 128-bit. Throwing densely packed 256-bit 
code at the pipeline can congest the front end causing stalls and hence 
slowdowns. We expect the behavior of cactus, calculix and sphinx, which are the 
3 benchmarks with the biggest avx-256 gaps, to be in the same vein. In general, 
the hardware design engineers recommend running AVX 128-bit code on Bulldozer. 
Given the underlying hardware design, software tuning can't really change the 
results here. Any further analysis of cactus would be a cycle sink at our end 
and we may not even be able to discuss the details on a public mailing list. 
x86/64 has not yet converged on avx-256 and generic mode should reflect that.

Posting the re-measurements on trunk for cactus, calculix and sphinx on 
Bulldozer:
AVX128/SSE42AVX256/AVX-128
436.cactusADM   10% -30%
454.calculix14.7%   -6%
482.sphinx3 7%  -9%

All positive % above are improvements, all negative % are degradations.

I will post re-measurements for all of Spec with latest trunk as soon as I have 
them.

Thoughts?

Thanks,
Harsha




Re: AVX generic mode tuning discussion.

2011-11-02 Thread Richard Guenther
On Wed, Nov 2, 2011 at 5:57 PM, Jagasia, Harsha harsha.jaga...@amd.com wrote:
We would like to propose changing AVX generic mode tuning to
  generate
   128-bit
AVX instead of 256-bit AVX.
  
   You indicate a 3% reduction on bulldozer with avx256.
   How does avx128 compare to -mno-avx -msse4.2?
 
  We see these % differences going from SSE42 to AVX128 to AVX256 on
  Bulldozer with -mtune=generic -Ofast.
  (Positive is improvement, negative is degradation)
 
  Bulldozer:
                        AVX128/SSE42    AVX256/AVX-128
  410.bwaves            -1.4%                   -1.4%
  416.gamess            -1.1%                   0.0%
  433.milc              0.5%                    -2.4%
  434.zeusmp            9.7%                    -2.1%
  435.gromacs           5.1%                    0.5%
  436.cactusADM         8.2%                    -23.8%
  437.leslie3d          8.1%                    0.4%
  444.namd              3.6%                    0.0%
  447.dealII            -1.4%                   -0.4%
  450.soplex            -0.4%                   -0.4%
  453.povray            0.0%                    -1.5%
  454.calculix          15.7%                   -8.3%
  459.GemsFDTD          4.9%                    1.4%
  465.tonto             1.3%                    -0.6%
  470.lbm               0.9%                    0.3%
  481.wrf               7.3%                    -3.6%
  482.sphinx3           5.0%                    -9.8%
  SPECFP                3.8%                    -3.2%
 
   Will the next AMD generation have a useable avx256?
   I'm not keen on the idea of generic mode being tune
   for a single processor revision that maybe shouldn't
   actually be using avx at all.
 
  We see a substantial gain in several SPECFP benchmarks going from
 SSE42
  to AVX128 on Bulldozer.
  IMHO, accomplishing even a 5% gain in an individual benchmark takes
 a
  hardware company several man months.
  The loss with AVX256 for Bulldozer is much more significant than the
  gain for SandyBridge.
  While the general trend in the industry is a move toward AVX256, for
  now we would be disadvantaging Bulldozer with this choice.
 
  We have several customers who use -mtune=generic and it is default,
  unless a user explicitly overrides it with -mtune=native. They are
 the
  ones who want to experiment with latest ISA using gcc, but want to
 keep
  their ISA selection and tuning agnostic on x86/64. IMHO, it is with
  these customers in mind that generic was introduced in the first
 place.
 
  Since stage 1 closure is around the corner, just wanted to ping to
 see if the maintainers have made up their mind on this one.
  AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes
 out pretty much all of that gain in generic mode.
  Until there is a convergence on AVX-256 for x86/64, we would like to
 propose having generic generate avx-128 by default and have a user
 override to avx-256 manually when known to benefit performance.

 Did somebody spend the time analyzing why CactusADM shows so much of a
 difference?
 With the recent improvements in vectorizing for AVX, did
 you
 re-do the measurements with a recent trunk?

 I don't think disabling avx-256 by default is a good idea until we
 understand why these numbers happen and are convinced we cannot fix
 this by proper
 cost modeling.

 We have observed cases where AVX-256 bit code is slower than AVX-128 bit code 
 on Bulldozer. This is because internally the front end, data paths etc for 
 Bulldozer are designed for optimal AVX 128-bit. Throwing densely packed 
 256-bit code at the pipeline can congest the front end causing stalls and 
 hence slowdowns. We expect the behavior of cactus, calculix and sphinx, which 
 are the 3 benchmarks with the biggest avx-256 gaps, to be in the same vein. 
 In general, the hardware design engineers recommend running AVX 128-bit code 
 on Bulldozer. Given the underlying hardware design, software tuning can't 
 really change the results here. Any further analysis of cactus would be a 
 cycle sink at our end and we may not even be able to discuss the details on a 
 public mailing list. x86/64 has not yet converged on avx-256 and generic mode 
 should reflect that.

Well, generic hasn't converged on AVX at all.  Cost modeling can deal
with code density just fine - are there any differences between code
density issues of
say, loads vs. stores vs. arithmetic?  I specifically ask about
analysis because AVX-256 has instruction set issues for certain
patterns the vectorizer generates
and the cost model currently does not reflect these at all.

Richard.

 Posting the re-measurements on trunk for cactus, calculix and sphinx on 
 Bulldozer:
                AVX128/SSE42    AVX256/AVX-128
 436.cactusADM   10%                     -30%
 454.calculix    14.7%                   -6%
 482.sphinx3         7%                  -9%

 All positive % above are improvements, all negative % are degradations.

 I will post re-measurements for all of Spec 

Re: AVX generic mode tuning discussion.

2011-11-01 Thread Richard Guenther
On Mon, Oct 31, 2011 at 9:36 PM, Jagasia, Harsha harsha.jaga...@amd.com wrote:
   We would like to propose changing AVX generic mode tuning to
 generate
  128-bit
   AVX instead of 256-bit AVX.
 
  You indicate a 3% reduction on bulldozer with avx256.
  How does avx128 compare to -mno-avx -msse4.2?

 We see these % differences going from SSE42 to AVX128 to AVX256 on
 Bulldozer with -mtune=generic -Ofast.
 (Positive is improvement, negative is degradation)

 Bulldozer:
                       AVX128/SSE42    AVX256/AVX-128
 410.bwaves            -1.4%                   -1.4%
 416.gamess            -1.1%                   0.0%
 433.milc              0.5%                    -2.4%
 434.zeusmp            9.7%                    -2.1%
 435.gromacs           5.1%                    0.5%
 436.cactusADM 8.2%                    -23.8%
 437.leslie3d  8.1%                    0.4%
 444.namd              3.6%                    0.0%
 447.dealII            -1.4%                   -0.4%
 450.soplex            -0.4%                   -0.4%
 453.povray            0.0%                    -1.5%
 454.calculix  15.7%                   -8.3%
 459.GemsFDTD  4.9%                    1.4%
 465.tonto             1.3%                    -0.6%
 470.lbm               0.9%                    0.3%
 481.wrf               7.3%                    -3.6%
 482.sphinx3           5.0%                    -9.8%
 SPECFP                3.8%                    -3.2%

  Will the next AMD generation have a useable avx256?
  I'm not keen on the idea of generic mode being tune
  for a single processor revision that maybe shouldn't
  actually be using avx at all.

 We see a substantial gain in several SPECFP benchmarks going from SSE42
 to AVX128 on Bulldozer.
 IMHO, accomplishing even a 5% gain in an individual benchmark takes a
 hardware company several man months.
 The loss with AVX256 for Bulldozer is much more significant than the
 gain for SandyBridge.
 While the general trend in the industry is a move toward AVX256, for
 now we would be disadvantaging Bulldozer with this choice.

 We have several customers who use -mtune=generic and it is default,
 unless a user explicitly overrides it with -mtune=native. They are the
 ones who want to experiment with latest ISA using gcc, but want to keep
 their ISA selection and tuning agnostic on x86/64. IMHO, it is with
 these customers in mind that generic was introduced in the first place.

 Since stage 1 closure is around the corner, just wanted to ping to see if the 
 maintainers have made up their mind on this one.
 AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes out 
 pretty much all of that gain in generic mode.
 Until there is a convergence on AVX-256 for x86/64, we would like to propose 
 having generic generate avx-128 by default and have a user override to 
 avx-256 manually when known to benefit performance.

Did somebody spend the time analyzing why CactusADM shows so much of a
difference?  With the recent improvements in vectorizing for AVX, did
you
re-do the measurements with a recent trunk?

I don't think disabling avx-256 by default is a good idea until we
understand why these numbers happen and are convinced we cannot fix
this by proper
cost modeling.

Richard.

 Thanks,
 Harsha




RE: AVX generic mode tuning discussion.

2011-10-31 Thread Jagasia, Harsha
   We would like to propose changing AVX generic mode tuning to
 generate
  128-bit
   AVX instead of 256-bit AVX.
 
  You indicate a 3% reduction on bulldozer with avx256.
  How does avx128 compare to -mno-avx -msse4.2?
 
 We see these % differences going from SSE42 to AVX128 to AVX256 on
 Bulldozer with -mtune=generic -Ofast.
 (Positive is improvement, negative is degradation)
 
 Bulldozer:
   AVX128/SSE42AVX256/AVX-128
 410.bwaves-1.4%   -1.4%
 416.gamess-1.1%   0.0%
 433.milc  0.5%-2.4%
 434.zeusmp9.7%-2.1%
 435.gromacs   5.1%0.5%
 436.cactusADM 8.2%-23.8%
 437.leslie3d  8.1%0.4%
 444.namd  3.6%0.0%
 447.dealII-1.4%   -0.4%
 450.soplex-0.4%   -0.4%
 453.povray0.0%-1.5%
 454.calculix  15.7%   -8.3%
 459.GemsFDTD  4.9%1.4%
 465.tonto 1.3%-0.6%
 470.lbm   0.9%0.3%
 481.wrf   7.3%-3.6%
 482.sphinx3   5.0%-9.8%
 SPECFP3.8%-3.2%
 
  Will the next AMD generation have a useable avx256?
  I'm not keen on the idea of generic mode being tune
  for a single processor revision that maybe shouldn't
  actually be using avx at all.
 
 We see a substantial gain in several SPECFP benchmarks going from SSE42
 to AVX128 on Bulldozer.
 IMHO, accomplishing even a 5% gain in an individual benchmark takes a
 hardware company several man months.
 The loss with AVX256 for Bulldozer is much more significant than the
 gain for SandyBridge.
 While the general trend in the industry is a move toward AVX256, for
 now we would be disadvantaging Bulldozer with this choice.
 
 We have several customers who use -mtune=generic and it is default,
 unless a user explicitly overrides it with -mtune=native. They are the
 ones who want to experiment with latest ISA using gcc, but want to keep
 their ISA selection and tuning agnostic on x86/64. IMHO, it is with
 these customers in mind that generic was introduced in the first place.

Since stage 1 closure is around the corner, just wanted to ping to see if the 
maintainers have made up their mind on this one.
AVX-128 is an improvement over SSE42 for Bulldozer and AVX-256 wipes out pretty 
much all of that gain in generic mode.
Until there is a convergence on AVX-256 for x86/64, we would like to propose 
having generic generate avx-128 by default and have a user override to avx-256 
manually when known to benefit performance.

Thanks,
Harsha



RE: AVX generic mode tuning discussion.

2011-07-21 Thread Jagasia, Harsha
 On 07/12/2011 02:22 PM, harsha.jaga...@amd.com wrote:
  We would like to propose changing AVX generic mode tuning to generate
 128-bit
  AVX instead of 256-bit AVX.
 
 You indicate a 3% reduction on bulldozer with avx256.
 How does avx128 compare to -mno-avx -msse4.2?

We see these % differences going from SSE42 to AVX128 to AVX256 on Bulldozer 
with -mtune=generic -Ofast.
(Positive is improvement, negative is degradation)

Bulldozer:  
AVX128/SSE42AVX256/AVX-128
410.bwaves  -1.4%   -1.4%
416.gamess  -1.1%   0.0%
433.milc0.5%-2.4%
434.zeusmp  9.7%-2.1%
435.gromacs 5.1%0.5%
436.cactusADM   8.2%-23.8%
437.leslie3d8.1%0.4%
444.namd3.6%0.0%
447.dealII  -1.4%   -0.4%
450.soplex  -0.4%   -0.4%
453.povray  0.0%-1.5%
454.calculix15.7%   -8.3%
459.GemsFDTD4.9%1.4%
465.tonto   1.3%-0.6%
470.lbm 0.9%0.3%
481.wrf 7.3%-3.6%
482.sphinx3 5.0%-9.8%
SPECFP  3.8%-3.2%

 Will the next AMD generation have a useable avx256?
 I'm not keen on the idea of generic mode being tune
 for a single processor revision that maybe shouldn't
 actually be using avx at all.

We see a substantial gain in several SPECFP benchmarks going from SSE42 to 
AVX128 on Bulldozer.
IMHO, accomplishing even a 5% gain in an individual benchmark takes a hardware 
company several man months.
The loss with AVX256 for Bulldozer is much more significant than the gain for 
SandyBridge.
While the general trend in the industry is a move toward AVX256, for now we 
would be disadvantaging Bulldozer with this choice.

We have several customers who use -mtune=generic and it is default, unless a 
user explicitly overrides it with -mtune=native. They are the ones who want to 
experiment with latest ISA using gcc, but want to keep their ISA selection and 
tuning agnostic on x86/64. IMHO, it is with these customers in mind that 
generic was introduced in the first place.

Thanks,
Harsha



RE: AVX generic mode tuning discussion.

2011-07-21 Thread Jagasia, Harsha
  We would like to propose changing AVX generic mode tuning to
 generate 128-bit
  AVX instead of 256-bit AVX.
 
  You indicate a 3% reduction on bulldozer with avx256.
  How does avx128 compare to -mno-avx -msse4.2?
  Will the next AMD generation have a useable avx256?
 
  I'm not keen on the idea of generic mode being tune
  for a single processor revision that maybe shouldn't
  actually be using avx at all.
 
 Btw, it looks like the data is massively skewed by
 436.cactusADM.  What are the overall numbers if you
 disregard cactus?  

Disregarding cactus, these are the cumulative SpecFP scores we see.

On Bulldozer:
AVX256/AVX128
SPECFP  -1.8%

On SandyBridge:
AVX256/AVX128
SPECFP  -0.15%

 It's also for sure the case that the vectorizer
 cost model has not been touched for avx256 vs. avx128 vs. sse,
 so a more sensible approach would be to look at differentiating
 things there to improve the cactus numbers.  

I am not sure how much the vectorizer cost model can help here.
The cost model can decide whether to vectorize and/or what vectorization factor 
to use.
But in generic mode, that decision has to be processor family neutral anyway.

 Harsha, did you
 investigate why avx256 is such a loss for cactus or why it is
 so much of a win for SB?

We are planning to investigate cactus and other cases to understand the reasons 
behind these observations better on Bulldozer, but disregarding cactus, there 
appear to be no significant gains on Sandybridge with AVX256 over AVX128 as 
well.

Thanks,
Harsha




Re: AVX generic mode tuning discussion.

2011-07-13 Thread Richard Guenther
On Tue, Jul 12, 2011 at 11:56 PM, Richard Henderson r...@redhat.com wrote:
 On 07/12/2011 02:22 PM, harsha.jaga...@amd.com wrote:
 We would like to propose changing AVX generic mode tuning to generate 128-bit
 AVX instead of 256-bit AVX.

 You indicate a 3% reduction on bulldozer with avx256.
 How does avx128 compare to -mno-avx -msse4.2?
 Will the next AMD generation have a useable avx256?

 I'm not keen on the idea of generic mode being tune
 for a single processor revision that maybe shouldn't
 actually be using avx at all.

Btw, it looks like the data is massively skewed by
436.cactusADM.  What are the overall numbers if you
disregard cactus?  It's also for sure the case that the vectorizer
cost model has not been touched for avx256 vs. avx128 vs. sse,
so a more sensible approach would be to look at differentiating
things there to improve the cactus numbers.  Harsha, did you
investigate why avx256 is such a loss for cactus or why it is
so much of a win for SB?

I suppose generic tuning is of less importance for AVX as
people need to enable that manually anyway (and will possibly
do so only via means of -march=native).

Thanks,
Richard.


 r~



Re: AVX generic mode tuning discussion.

2011-07-13 Thread Jakub Jelinek
On Wed, Jul 13, 2011 at 10:42:41AM +0200, Richard Guenther wrote:
 I suppose generic tuning is of less importance for AVX as
 people need to enable that manually anyway (and will possibly
 do so only via means of -march=native).

Yeah, but if somebody does compile with -mavx -mtune=generic,
I'd expect the intent is that he wants fastest code not just on current
generation of CPUs, but on the next few following ones, and I'd say that
being able to use twice as big vectorization factor ought to be a win in
most cases if the cost model gets it right.  If not for the vectorization
factor doubling, what would be reasons why somebody would compile
code with -mavx -mtune=generic and rule out support for many recent chips?
Yeah, there are the  2 operand forms and such code can avoid penalty when
mixed with AVX256 code, but would that be strong reason enough to lose the
support of most of the recent CPUs?  When targeting just a particular CPU
and using -march= with CPU which already includes AVX, -mtune=generic probably
doesn't make much sense, you probably want -march=native and you are
optimizing for the CPU you have.

Jakub


AVX generic mode tuning discussion.

2011-07-12 Thread harsha.jagasia
We would like to propose changing AVX generic mode tuning to generate 128-bit
AVX instead of 256-bit AVX. As per H.J's suggestion, we have reviewed the
various tuning choices made for generic mode with respect to AMD's upcoming
Bulldozer processor. At this moment, this is the most significant change we
have to propose. While we are willing to re-engineer generic mode, this
feature needs immediate discussion since the performance impact on Bulldozer
is significant.

Here is the relative CPU2006 performance data we have gathered using gcc on AMD
Bulldozer (BD) and Intel Sandybridge (SB) machines with -Ofast -mtune=generic
-mavx.

%gain/loss avx256 vs avx128
(negative % indicates loss
positive % indicates gain)

AMD BD  Intel SB
410.bwaves  -2.34   -1.52  
416.gamess  -1.11   -0.30
433.milc0.47-1.75
434.zeusmp  -3.61   0.68
435.gromacs -0.54   -0.38
436.cactusADM   -23.56  21.49
437.leslie3d-0.44   1.56
444.namd0.000.00
447.dealII  -0.36   -0.23
450.soplex  -0.43   -0.29
453.povray  0.503.63
454.calculix-8.29   1.38
459.GemsFDTD2.37-1.54
465.tonto   0.000.00
470.lbm 0.000.21
481.wrf -4.80   0.00
482.sphinx3 -10.20  -3.65
SpecINT -3.29   1.01

400.perlbench   0.931.47
401.bzip2   0.600.00
403.gcc 0.000.00
429.mcf 0.00-0.36
445.gobmk   -1.03   0.37
456.hmmer   -0.64   0.38
458.sjeng   1.740.00
462.libquantum  0.310.00
464.h264ref 0.000.00
471.omnetpp -1.27   0.00
473.astar   0.000.46
483.xalancbmk   0.510.00
SpecFP  0.090.19

As per the data, the 1% performance gain for Intel Sandybridge on SpecFP is
eclipsed by a 3% degradation for AMD Bulldozer.

For the data above, generic mode splits both 256-bit misaligned loads and
stores, as is currently the case in trunk. 

Even if we disable 256-bit misaliged load splitting, AVX 256-bit performance
improves only by ~1.4% on SpecFP for AMD Bulldozer. On the other hand, AVX
256-bit performance drops by 0.12% on Intel Sandybridge. In this case with
AVX 256 load splitting disabled, a cumulative 0.9% performance gain for Intel
Sandybridge is reflected versus a 1.9% loss for AMD Bulldozer comparing AVX 256
to AVX 128 and hence AVX 256 is still not a fair choice for generic mode.

Please provide thoughts. It would be great if HJ can verify Intel Sandybridge
data.

Thanks,
Harsha




Re: AVX generic mode tuning discussion.

2011-07-12 Thread Richard Henderson
On 07/12/2011 02:22 PM, harsha.jaga...@amd.com wrote:
 We would like to propose changing AVX generic mode tuning to generate 128-bit
 AVX instead of 256-bit AVX.

You indicate a 3% reduction on bulldozer with avx256.
How does avx128 compare to -mno-avx -msse4.2?
Will the next AMD generation have a useable avx256?

I'm not keen on the idea of generic mode being tune
for a single processor revision that maybe shouldn't
actually be using avx at all.


r~