RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost

2015-04-12 Thread Gopalasubramanian, Ganesh
Note that before the fixes for PR64909 the epilogue/prologue loops had very 
large costs associated due to a bug in the cost model implementation.  After 
the fix their cost is reasonable but the cost of the extra jumps is way 
under-accounted for due to the numbers for cond_taken_branch_cost and 
cond_not_taken_branch_cost.
 The proposes match mitigates that somewhat.

Richard! The patch is good. We are done with our benchmarking and found no 
regressions.

 How did you arrive at the original cost model?

The original cost model as you suspect is not based on architecture alone. 
Those are the numbers arrived at by analyzing benchmarks and the cost model 
bugs then. These initial numbers were copied for subsequent architectures too.
Cost assignments saying scalar_stmt_cost = 6 and scalar load_cost = 4 
doesn't make sense at all. We will have a look into it.

Regards
Ganesh

-Original Message-
From: Richard Biener [mailto:rguent...@suse.de] 
Sent: Wednesday, April 08, 2015 1:08 PM
To: Gopalasubramanian, Ganesh
Cc: Uros Bizjak; gcc-patches@gcc.gnu.org
Subject: RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost

On Wed, 8 Apr 2015, Gopalasubramanian, Ganesh wrote:

  I have added a person from AMD to comment on the decision.
  Otherwise, the patch looks OK, but please wait a couple of days for 
  possible comments.
 
 Thank you Uros!
 I am checking the changes with few tests and benchmarking them.
 Please wait for a couple of days.

Note that before the fixes for PR64909 the epilogue/prologue loops had very 
large costs associated due to a bug in the cost model implementation.  After 
the fix their cost is reasonable but the cost of the extra jumps is way 
under-accounted for due to the numbers for cond_taken_branch_cost and 
cond_not_taken_branch_cost.
The proposes match mitigates that somewhat.

How did you arrive at the original cost model?

Thanks,
Richard.

--
Richard Biener rguent...@suse.de
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton HRB 21284 (AG Nuernberg)


RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost

2015-04-08 Thread Gopalasubramanian, Ganesh
 I have added a person from AMD to comment on the decision.
 Otherwise, the patch looks OK, but please wait a couple of days for possible 
 comments.

Thank you Uros!
I am checking the changes with few tests and benchmarking them.
Please wait for a couple of days.

-Ganesh


RE: [PATCH] Rename gimple_build_assign_with_ops to gimple_build_assign and swap the first two arguments of it

2014-12-01 Thread Gopalasubramanian, Ganesh
The following patch implements that.  Bootstrapped/regtested on x86_64-linux
and i686-linux, ok for trunk?

Our aarch64 build also breaks as mentioned in 
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00119.html

Regards
Ganesh


RE: [PATCH, aarch64] Add prefetch support

2014-11-30 Thread Gopalasubramanian, Ganesh
 There's no point in the buffer or the sprintf.
 The text is short enough to repeat whole pattern in the array:

Updated the patch for the above suggestions.
Is it ok for upstream?

Regards
Ganesh


prefetch.diff
Description: prefetch.diff


RE: [PATCH, aarch64] Add prefetch support

2014-11-30 Thread Gopalasubramanian, Ganesh
Please ignore the previous patch sent. The attachment was wrong.

 There's no point in the buffer or the sprintf.
 The text is short enough to repeat whole pattern in the array:

Updated the patch for the above suggestions.
make -k check RUNTESTFLAGS=execute.exp compile.exp dg.exp passes.

Is it ok for upstream?

Regards
Ganesh


prefetch.diff
Description: prefetch.diff


FW: [PATCH, aarch64] Add prefetch support

2014-11-10 Thread Gopalasubramanian, Ganesh
PING! I am worried if it goes in stage-1.

-Original Message-
From: Gopalasubramanian, Ganesh 
Sent: Thursday, October 30, 2014 2:24 PM
To: gcc-patches@gcc.gnu.org
Subject: [PATCH, aarch64] Add prefetch support

Hi,

Below is the patch that implements prefetching support.

This patch has been already discussed on
a) https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html
b) https://gcc.gnu.org/ml/gcc-patches/2014-07/msg00612.html

I have not added a test as there are ample tests in compile and execute suites.
make -k check passes. Ok for trunk?

Changelog:
2014-10-30  Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com

    * config/aarch64/aarch64.md (define_insn prefetch): New.
    * config/arm/types.md (define_attr type): Add prefetch.

Regards
Ganesh

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md 
index 74b554e..12a3f170 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -320,6 +320,38 @@
   [(set_attr type no_insn)]
)

+
+(define_insn prefetch
+  [(prefetch (match_operand:DI 0 address_operand r)
+    (match_operand:QI 1 const_int_operand )
+    (match_operand:QI 2 const_int_operand ))]
+  
+  *
+{
+  const char * pftype[2][10]
+    = { {\PLDL1STRM\, \PLDL3KEEP\, \PLDL2KEEP\, \PLDL1KEEP\},
+   {\PSTL1STRM\, \PSTL3KEEP\, \PSTL2KEEP\, \PSTL1KEEP\},
+  };
+
+  int locality = INTVAL (operands[2]);
+  char pattern[100];
+
+  gcc_assert (IN_RANGE (locality, 0, 3));
+
+  strcpy (pattern, \prfm\\t\);
+  strcat (pattern, (const char*)pftype[INTVAL(operands[1])][locality]);
+  strcat (pattern, \, %a0\);
+
+  output_asm_insn (pattern,
+   operands);
+
+  return \\;
+
+}
+  [(set_attr type prefetch)]
+)
+
(define_insn trap
   [(trap_if (const_int 1) (const_int 8))]
   
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md index 
c1151f5..8b4b7a1 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -118,6 +118,7 @@
; mvn_shift_reg  inverting move instruction, shifted operand by a register.
; no_insn    an insn which does not represent an instruction in the ;   
 final output, thus having no impact on scheduling.
+; prefetch   a prefetch instruction
; rbit   reverse bits.
; rev    reverse bytes.
; sdiv   signed division.
@@ -556,6 +557,7 @@
   call,\
   clz,\
   no_insn,\
+  prefetch,\
   csel,\
   crc,\


[PATCH, aarch64] Add prefetch support

2014-10-30 Thread Gopalasubramanian, Ganesh
Hi,

Below is the patch that implements prefetching support.

This patch has been already discussed on 
a) https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html
b) https://gcc.gnu.org/ml/gcc-patches/2014-07/msg00612.html

I have not added a test as there are ample tests in compile and execute suites.
make -k check passes. Ok for trunk?

Changelog:
2014-10-30  Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com

    * config/aarch64/aarch64.md (define_insn prefetch): New.
    * config/arm/types.md (define_attr type): Add prefetch.

Regards
Ganesh

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 74b554e..12a3f170 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -320,6 +320,38 @@
   [(set_attr type no_insn)]
)

+
+(define_insn prefetch
+  [(prefetch (match_operand:DI 0 address_operand r)
+    (match_operand:QI 1 const_int_operand )
+    (match_operand:QI 2 const_int_operand ))]
+  
+  *
+{
+  const char * pftype[2][10]
+    = { {\PLDL1STRM\, \PLDL3KEEP\, \PLDL2KEEP\, \PLDL1KEEP\},
+   {\PSTL1STRM\, \PSTL3KEEP\, \PSTL2KEEP\, \PSTL1KEEP\},
+  };
+
+  int locality = INTVAL (operands[2]);
+  char pattern[100];
+
+  gcc_assert (IN_RANGE (locality, 0, 3));
+
+  strcpy (pattern, \prfm\\t\);
+  strcat (pattern, (const char*)pftype[INTVAL(operands[1])][locality]);
+  strcat (pattern, \, %a0\);
+
+  output_asm_insn (pattern,
+   operands);
+
+  return \\;
+
+}
+  [(set_attr type prefetch)]
+)
+
(define_insn trap
   [(trap_if (const_int 1) (const_int 8))]
   
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
index c1151f5..8b4b7a1 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -118,6 +118,7 @@
; mvn_shift_reg  inverting move instruction, shifted operand by a register.
; no_insn    an insn which does not represent an instruction in the
;    final output, thus having no impact on scheduling.
+; prefetch   a prefetch instruction
; rbit   reverse bits.
; rev    reverse bytes.
; sdiv   signed division.
@@ -556,6 +557,7 @@
   call,\
   clz,\
   no_insn,\
+  prefetch,\
   csel,\
   crc,\


RE: RFA: another patch to fix PR61360

2014-09-24 Thread Gopalasubramanian, Ganesh
The r-x alternative results in vector decoding on amdfam10. This is 
AMD-speak for microcoded instructions, and AMD optimization manual strongly 
recommends avoiding them. I have CC'd Ganesh, maybe he can provide more 
relevant data on the performance impact.

Thanks Uros!

Yes, the AMD SWOG recommends precisely what Uros mentions.
snip from SWOG for BD
When moving data from a GPR to an XMM register, use separate store and load 
instructions to move
the data first from the source register to a temporary location in memory and 
then from memory into
the destination register
/snip

This is listed as an optimization too. This holds good for all amdfam10 and BD  
family processors. 
I have to dig through the performance numbers will try to get them.

Regards
Ganesh


[PATCH, i386] PR61360: Do not update enabled attribute during lra and reload passes

2014-08-22 Thread Gopalasubramanian, Ganesh
This patch fixes PR 61360.

The attribute enabled should actually be used enable/disable alternative 
based on sub-targets.
In this pattern, it gets used across passes too. 
However, modifying this attribute in LRA pass is not something it is meant for.
This patch allows enabling/disabling the attribute when optimizing for size, 
but not during lra pass or reload pass.

Bootstrap passes.
OK for upstream?

Regards
Ganesh

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6d91da0..3775f6e 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,4 +1,10 @@
-2014-08-22  David Malcolm  dmalc...@redhat.com
+2014-08-22 Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
+
+   PR 61360
+   * config/i386/i386.md (*floatSWI48:modeMODEF:mode2_sse):
+   Do not modify enabled attribute during LRA pass.
+
+014-08-22  David Malcolm  dmalc...@redhat.com

* cprop.c (struct occr): Strengthen field insn from rtx to
rtx_insn *.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 8e74eab..de2ecf0 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4795,10 +4795,10 @@
   /* ??? For sched1 we need constrain_operands to be able to
  select an alternative.  Leave this enabled before RA.  */
   (symbol_ref TARGET_INTER_UNIT_CONVERSIONS
-   || optimize_function_for_size_p (cfun)
-   || !(reload_completed
-|| reload_in_progress
-|| lra_in_progress))
+   || (optimize_function_for_size_p (cfun)
+!(reload_completed
+|| reload_in_progress
+|| lra_in_progress)))
]
(symbol_ref true)))
])


RE: [PATCH, i386] Remove use of vpmacsdql instruction from multiplication.

2014-08-11 Thread Gopalasubramanian, Ganesh
Hi Uros!

  +2014-06-10  Ganesh Gopalasubramanian 
  +ganesh.gopalasubraman...@amd.com
  +
  +   * config/i386/i386.c (ix86_expand_sse2_mulvxdi3): Issue 
  +instructions vpmuludq and vpaddq instead of vpmacsdql for 
  +handling 32-bit multiplication.
 

 OK for mainline and release branches.

I would like to backport the above patch for 4.9. Is it OK?

Regards
Ganesh


RE: [PATCH, i386] Add RDRND and MOVBE for AMD bdver4

2014-08-08 Thread Gopalasubramanian, Ganesh
 OK for mainline.

Thanks Uros. Committed to revision 213572
I would like to backport to 4.9 branch too.
Is it OK?

- Ganesh


[PATCH, i386] Add RDRND and MOVBE for AMD bdver4

2014-08-04 Thread Gopalasubramanian, Ganesh
Below patch adds PTA_RDRND and PTA_MOVBE for bdver4.

Bootstrap passes.

Ok for upstream?

Regards
Ganesh

Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 213568)
+++ gcc/ChangeLog   (working copy)
@@ -24,6 +24,11 @@

2014-08-04 Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com

+* gcc/config/i386/i386.c (ix86_option_override_internal): Add
+   PTA_RDRND and PTA_MOVBE for bdver4.
+
+2014-08-04 Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
+
* config/i386/driver-i386.c (host_detect_local_cpu): Handle AMD's 
extended
family information. Handle BTVER2 cpu with cpuid family value.

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 213568)
+++ gcc/config/i386/i386.c  (working copy)
@@ -3267,12 +3267,13 @@ ix86_option_override_internal (bool main
| PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE
| PTA_XSAVEOPT | PTA_FSGSBASE},
  {bdver4, PROCESSOR_BDVER4, CPU_BDVER4,
-PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
-| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2
+   PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
+   | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
+   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2
| PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_BMI2
| PTA_TBM | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR
-   | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE},
+   | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE | PTA_RDRND
+   | PTA_MOVBE},
   {btver1, PROCESSOR_BTVER1, CPU_GENERIC,
PTA_64BIT | PTA_MMX |  PTA_SSE  | PTA_SSE2 | PTA_SSE3
| PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW


RE: [PATCH, i386] Handle extended family cpuid info for AMD

2014-08-01 Thread Gopalasubramanian, Ganesh
Uros!

 I would like to have a check for a family at the beginning, something like:

  if (name == signature_NSC_ebx)
processor = PROCESSOR_GEODE;
  else if (family == 22)
{
  if (has_movbe)

I get your idea of having the family checked first and then differentiating 
with cpuid info later.
But, this case is getting interesting because, BTVER1 and BTVER2 are two 
variants but don't really have same family numbers.
BTVER1 is family 14h and BTVER2 is family 16h. I don't see near term plans for 
any additional cpus to either 14h or 16h.
Given that fact, this particular check is applicable only for BTVER2.
In that case, having 

  else if (family == 22)
if (has_movbe)
   processor = PROCESSOR_BTVER2;

looks odd. 

Regards
Ganesh


RE: [PATCH, i386] Handle extended family cpuid info for AMD

2014-08-01 Thread Gopalasubramanian, Ganesh
 In this case, having only check for family ID should be enough. If
BTVER1 and BTVER2 can be uniquely determined by their family IDs ,

IMO, this would be the most future-proof approach. Signature checks will 
override family id checks which will override cpuid checks.

Thank you Uros!

I have modified source only for BTVER2. 
The way BTVER1 is currently assigned to processor includes more than one 
family. So, I am leaving that unmoved.

Bootstrap passes.
Is it OK for trunk and backport to open branches.

Regards
-Ganesh


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 706fedc..202bd99 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2014-08-01  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
+
+   * config/i386/driver-i386.c (host_detect_local_cpu): Handle AMD's 
extended family
+   information. Handle BTVER2 cpu with cpuid family value.  
+
 2014-07-31  James Greenhalgh  james.greenha...@arm.com
 
* config/aarch64/arm_neon.h (vpadd_suf8,16,32,64): Move to
diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c
index 1c6385f..0402c90 100644
--- a/gcc/config/i386/driver-i386.c
+++ b/gcc/config/i386/driver-i386.c
@@ -432,7 +432,8 @@ const char *host_detect_local_cpu (int argc, const char 
**argv)
 
   model = (eax  4)  0x0f;
   family = (eax  8)  0x0f;
-  if (vendor == signature_INTEL_ebx)
+  if ((vendor == signature_INTEL_ebx) ||
+  (vendor == signature_AMD_ebx))
 {
   unsigned int extended_model, extended_family;
 
@@ -576,7 +577,7 @@ const char *host_detect_local_cpu (int argc, const char 
**argv)
 
   if (name == signature_NSC_ebx)
processor = PROCESSOR_GEODE;
-  else if (has_movbe)
+  else if (family == 22)
processor = PROCESSOR_BTVER2;
   else if (has_avx2)
 processor = PROCESSOR_BDVER4;


[PATCH, i386] Handle extended family cpuid info for AMD

2014-07-31 Thread Gopalasubramanian, Ganesh
Hi,

The below patch handles the AMD's cpuid family information.
With the information from cpuid, BTVER2 cpu for -march=native flag is handled.

Bootstrap passes.
Is it OK for trunk and branches?

Regards
Ganesh 

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6223bd6..3f8bb2c 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2014-07-31  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
+
+   * tree-sra.c (host_detect_local_cpu): Handle AMD's extended family
+   information. Handle BTVER2 cpu with cpuid family value.  
+
 2014-07-30  Martin Jambor  mjam...@suse.cz
 
* tree-sra.c (sra_ipa_modify_assign): Change type of the first
diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c
index 1c6385f..21ae1f3 100644
--- a/gcc/config/i386/driver-i386.c
+++ b/gcc/config/i386/driver-i386.c
@@ -432,7 +432,8 @@ const char *host_detect_local_cpu (int argc, const char 
**argv)
 
   model = (eax  4)  0x0f;
   family = (eax  8)  0x0f;
-  if (vendor == signature_INTEL_ebx)
+  if ((vendor == signature_INTEL_ebx) ||
+  (vendor == signature_AMD_ebx))
 {
   unsigned int extended_model, extended_family;
 
@@ -576,7 +577,7 @@ const char *host_detect_local_cpu (int argc, const char 
**argv)
 
   if (name == signature_NSC_ebx)
processor = PROCESSOR_GEODE;
-  else if (has_movbe)
+  else if (has_movbe  family == 22)
processor = PROCESSOR_BTVER2;
   else if (has_avx2)
 processor = PROCESSOR_BDVER4;


RE: [PATCH, i386] Handle extended family cpuid info for AMD

2014-07-31 Thread Gopalasubramanian, Ganesh
 But, looking to processor_alias_table in config/i386/i386.c, only
 PROCESSOR_BTVER2 defines PTA_MOVBE. According to this, the logic is already 
 correct, so the patch is not needed.

We are evaluating bdver4 cpu. Bdver4 also supports MOVBE. I will submit patch 
for bdver4 PTA after our evaluation.

Ganesh.



RE: [PATCH, i386] Handle extended family cpuid info for AMD

2014-07-31 Thread Gopalasubramanian, Ganesh
 Then just use:

 +  else if (has_avx2)
 +processor = PROCESSOR_BDVER4;
   else if (has_movbe)
processor = PROCESSOR_BTVER2;
-  else if (has_avx2)
-processor = PROCESSOR_BDVER4;
  else if (has_xsaveopt)

In that case, with earlier GCC versions where we don’t have bdver4 added, the 
fall back would be BTVER2, 
whereas a BD variant is more desirable.

Ganesh


FW: [PATCH, aarch64] Add prefetch support

2014-07-08 Thread Gopalasubramanian, Ganesh
PING!

-Original Message-
From: Gopalasubramanian, Ganesh 
Sent: Sunday, July 06, 2014 2:12 AM
To: gcc-patches@gcc.gnu.org
Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com
Subject: RE: [PATCH, aarch64] Add prefetch support

PING!

From: Gopalasubramanian, Ganesh
Sent: Friday, July 04, 2014 5:57 AM
To: gcc-patches@gcc.gnu.org
Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com
Subject: [PATCH, aarch64] Add prefetch support

Hi,

Attached is a patch that implements
*   Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
Added a predicate for this.
*   Prefetch with immediate offset - in the range -256 to 255 (Gets 
generated only when we have a negative offset. Generates prfum instruction). 
Added a predicate for this.
*   Prefetch with register offset. (modified for printing the locality)

This patch has been already discussed on 
https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html

make -k check passes. Ok for trunk?

Changelog

2014-07-04  Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com

* config/aarch64/aarch64.md (define_insn *prefetch)
(define_insn prefetch): New
* config/aarch64/predicates.md (aarch64_prefetch_pimm)
(aarch64_prefetch_unscaled): New.
* config/arm/types.md (define_attr type): Add prefetch.

Regards
Ganesh


RE: [PATCH, aarch64] Add prefetch support

2014-07-05 Thread Gopalasubramanian, Ganesh
PING!

From: Gopalasubramanian, Ganesh
Sent: Friday, July 04, 2014 5:57 AM
To: gcc-patches@gcc.gnu.org
Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com
Subject: [PATCH, aarch64] Add prefetch support

Hi,

Attached is a patch that implements
*   Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
Added a predicate for this.
*   Prefetch with immediate offset - in the range -256 to 255 (Gets 
generated only when we have a negative offset. Generates prfum instruction). 
Added a predicate for this.
*   Prefetch with register offset. (modified for printing the locality)

This patch has been already discussed on 
https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html

make -k check passes. Ok for trunk?

Changelog

2014-07-04  Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com

* config/aarch64/aarch64.md (define_insn *prefetch)
(define_insn prefetch): New
* config/aarch64/predicates.md (aarch64_prefetch_pimm)
(aarch64_prefetch_unscaled): New.
* config/arm/types.md (define_attr type): Add prefetch.

Regards
Ganesh


[PATCH, aarch64] Add prefetch support

2014-07-04 Thread Gopalasubramanian, Ganesh
Hi,

Attached is a patch that implements
*   Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
Added a predicate for this.
*   Prefetch with immediate offset - in the range -256 to 255 (Gets 
generated only when we have a negative offset. Generates prfum instruction). 
Added a predicate for this.
*   Prefetch with register offset. (modified for printing the locality)

This patch has been already discussed on 
https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html

make -k check passes. Ok for trunk?

Changelog

2014-07-04  Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com

* config/aarch64/aarch64.md (define_insn *prefetch)
(define_insn prefetch): New
* config/aarch64/predicates.md (aarch64_prefetch_pimm) 
(aarch64_prefetch_unscaled): New.
* config/arm/types.md (define_attr type): Add prefetch.

Regards
Ganesh


prefetch.diff
Description: prefetch.diff


[PATCH, i386] Remove use of vpmacsdql instruction from multiplication.

2014-06-10 Thread Gopalasubramanian, Ganesh
Hi,

The below patch fixes the issue with 64-bit multiplication.
The instruction vpmacsdql does signed 32-bit multiplication.
For V2DImode, we require widened unsigned multiplication.
So, replacing the vpmacsdql instruction with vpmuludq and vpaddq.

This patch had been already discussed in 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52908

With required change in the test xop-imul64-vector.c,  make check passes. Is it 
OK for upstream?

Regards
Ganesh

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index d0a1253..c158612 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2014-06-10  Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com
+
+   * config/i386/i386.c (ix86_expand_sse2_mulvxdi3): Issue instructions
+vpmuludq and vpaddq instead of vpmacsdql for handling 32-bit
+multiplication.
+
 2014-06-07  Jan Hubicka  hubi...@ucw.cz

* cgraphunit.c (assemble_thunks_and_aliases): Expand thunks before
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 9105132..184d82d 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -45205,8 +45205,10 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
   /* t4: ((B*E)+(A*F))32, ((D*G)+(C*H))32 */
   emit_insn (gen_ashlv2di3 (t4, t3, GEN_INT (32)));

-  /* op0: (((B*E)+(A*F))32)+(B*F), (((D*G)+(C*H))32)+(D*H) */
-  emit_insn (gen_xop_pmacsdql (op0, op1, op2, t4));
+  /* Multiply lower parts and add all */
+  t5 = gen_reg_rtx (V2DImode);
+  emit_insn (gen_vec_widen_umult_even_v4si (t5, gen_lowpart (V4SImode, 
op1), gen_lowpart (V4SImode, op2)));
+  op0 = expand_binop (mode, add_optab, t5, t4, op0, 1, OPTAB_DIRECT);
 }
   else
 {
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index a6913af..757d3e3 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2014-06-10 Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
+
+   * gcc.target/i386/xop-imul64-vector.c: Remove the check for
+   vpmacsdql instruction.
+
 2014-06-07  Eric Botcazou  ebotca...@adacore.com

* gnat.dg/opt38.adb: New test.
diff --git a/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c 
b/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c
index fbf605f..fc8c880 100644
--- a/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c
+++ b/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c
@@ -33,4 +33,3 @@ int main ()

 /* { dg-final { scan-assembler vpmulld } } */
 /* { dg-final { scan-assembler vphadddq } } */
-/* { dg-final { scan-assembler vpmacsdql } } */


RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.

2014-05-28 Thread Gopalasubramanian, Ganesh
Hi Philipp,

 These changes look good to me.
 We'll try them out on the benchmarks that caused us to add prefetching in the 
 first place.

If you are OK, I would like to get these changes upstreamed.

-Ganesh

-Original Message-
From: Dr. Philipp Tomsich [mailto:philipp.toms...@theobroma-systems.com] 
Sent: Friday, February 28, 2014 2:58 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; pins...@gmail.com
Subject: Re: [AArch64 05/14] Add AArch64 'prefetch'-pattern.

Ganesh,

On 28 Feb 2014, at 10:13 , Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 I also have attached a patch that implements the following. 
 * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
 Added a predicate for this.
 * Prefetch with immediate offset - in the range -256 to 255 (Gets 
 generated only when we have a negative offset. Generates prfum instruction). 
 Added a predicate for this.
 * Prefetch with register offset. (modified for printing the locality)

These changes look good to me.
We'll try them out on the benchmarks that caused us to add prefetching in the 
first place.

Best,
Philipp.


RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.

2014-02-28 Thread Gopalasubramanian, Ganesh
With the locality value received in the instruction pattern, I think it would 
be safe to handle them in prefetch instruction.
This helps especially AArch64 has prefetch instructions that can handle this 
locality.

+(define_insn prefetch
+  [(prefetch (match_operand:DI 0 address_operand r)
+(match_operand:QI 1 const_int_operand n)
+(match_operand:QI 2 const_int_operand n))]
+  
+  *
+{
+  int locality = INTVAL (operands[2]);
+
+  gcc_assert (IN_RANGE (locality, 0, 3));
+
+  if (locality == 0)
+ /* non temporal locality */
+ return (INTVAL(operands[1])) ? \prfm\\tPSTL1STRM, [%0, #0]\ : 
\prfm\\tPLDL1STRM, [%0, #0]\;
+
+  /* temporal locality */
+  return (INTVAL(operands[1])) ? \prfm\\tPSTL%2KEEP, [%0, #0]\ : 
\prfm\\tPLDL%2KEEP, [%0, #0]\;
+}
+  [(set_attr type prefetch)]
+)
+

I also have attached a patch that implements
*   Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
Added a predicate for this.
*   Prefetch with immediate offset - in the range -256 to 255 (Gets 
generated only when we have a negative offset. Generates prfum instruction). 
Added a predicate for this.
*   Prefetch with register offset. (modified for printing the locality)

Regards
Ganesh

-Original Message-
From: Philipp Tomsich [mailto:philipp.toms...@theobroma-systems.com] 
Sent: Wednesday, February 19, 2014 2:40 AM
To: gcc-patches@gcc.gnu.org
Cc: philipp.toms...@theobroma-systems.com
Subject: [AArch64 05/14] Add AArch64 'prefetch'-pattern.

---
 gcc/config/aarch64/aarch64.md | 17 +
 gcc/config/arm/types.md   |  2 ++
 2 files changed, 19 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md 
index 99a6ac8..b972a1b 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -293,6 +293,23 @@
   [(set_attr type no_insn)]
 )
 
+(define_insn prefetch
+  [(prefetch (match_operand:DI 0 register_operand r)
+(match_operand:QI 1 const_int_operand n)
+(match_operand:QI 2 const_int_operand n))]
+  
+  *
+{
+  if (INTVAL(operands[2]) == 0)
+ /* no temporal locality */
+ return (INTVAL(operands[1])) ? \prfm\\tPSTL1STRM, [%0, #0]\ : 
+\prfm\\tPLDL1STRM, [%0, #0]\;
+
+  /* temporal locality */
+  return (INTVAL(operands[1])) ? \prfm\\tPSTL1KEEP, [%0, #0]\ : 
+\prfm\\tPLDL1KEEP, [%0, #0]\; }
+  [(set_attr type prefetch)]
+)
+
 (define_insn trap
   [(trap_if (const_int 1) (const_int 8))]
   
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md index 
cc39cd1..1d1280d 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -117,6 +117,7 @@
 ; mvn_shift_reg  inverting move instruction, shifted operand by a register.
 ; no_insnan insn which does not represent an instruction in the
 ;final output, thus having no impact on scheduling.
+; prefetch  a prefetch instruction
 ; rbit   reverse bits.
 ; revreverse bytes.
 ; sdiv   signed division.
@@ -553,6 +554,7 @@
   call,\
   clz,\
   no_insn,\
+  prefetch,\
   csel,\
   crc,\
   extend,\
--
1.9.0



prefetchdiff.log
Description: prefetchdiff.log


RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.

2014-02-28 Thread Gopalasubramanian, Ganesh
Avoided top-posting and resending.

+  /* temporal locality */
+  return (INTVAL(operands[1])) ? \prfm\\tPSTL1KEEP, [%0, #0]\ : 
+\prfm\\tPLDL1KEEP, [%0, #0]\; }
+  [(set_attr type prefetch)]
+)
+

With the locality value received in the instruction pattern, I think it would 
be safe to handle them in prefetch instruction.
This helps especially AArch64 has prefetch instructions that can handle this 
locality.

+(define_insn prefetch
+  [(prefetch (match_operand:DI 0 address_operand r)
+(match_operand:QI 1 const_int_operand n)
+(match_operand:QI 2 const_int_operand n))]
+  
+  *
+{
+  int locality = INTVAL (operands[2]);
+
+  gcc_assert (IN_RANGE (locality, 0, 3));
+
+  if (locality == 0)
+ /* non temporal locality */
+ return (INTVAL(operands[1])) ? \prfm\\tPSTL1STRM, [%0, #0]\ : 
\prfm\\tPLDL1STRM, [%0, #0]\;
+
+  /* temporal locality */
+  return (INTVAL(operands[1])) ? \prfm\\tPSTL%2KEEP, [%0, #0]\ : 
\prfm\\tPLDL%2KEEP, [%0, #0]\;
+}
+  [(set_attr type prefetch)]
+)
+

I also have attached a patch that implements the following. 
*   Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
Added a predicate for this.
*   Prefetch with immediate offset - in the range -256 to 255 (Gets 
generated only when we have a negative offset. Generates prfum instruction). 
Added a predicate for this.
*   Prefetch with register offset. (modified for printing the locality)

Regards
Ganesh


prefetchdiff.log
Description: prefetchdiff.log


FW: Non-temporal move

2014-02-24 Thread Gopalasubramanian, Ganesh
I could see storent pattern in x86 machine descriptions (in sse.md)., but 
internals doc don't mention it. Should we add a description about this in the 
internals doc?

Regards
Ganesh



RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-26 Thread Gopalasubramanian, Ganesh
 I'm sorry I didn't notice previous conversation. Please install ASAP.

Thanks Uros! Committed to revision 206210.
- Ganesh



RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-25 Thread Gopalasubramanian, Ganesh
Hi,

 (get_amd_cpu): Handle AMD_BOBCAT, AMD_JAGUAR, AMDFAM15H_BDVER2 and
 AMDFAM15H_BDVER3.

As mentioned earlier, we would like to stick with BTVER1 and BTVER2 instead of 
using BOBCAT or JAGUAR.
Attached patch does the changes.

Regards
Ganesh


NameChange.patch
Description: NameChange.patch


RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-12-19 Thread Gopalasubramanian, Ganesh
 Please provide updated ChangeLog.

--- gcc/ChangeLog   (revision 206106)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,14 @@
+2013-12-19  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
+
+   * config/i386/i386.c: Include cfgloop.h. 
+   (ix86_loop_memcount): New function.
+   (ix86_loop_unroll_adjust): New function.
+   (TARGET_LOOP_UNROLL_ADJUST): Define.
+   * config/i386/i386.h
+   (TARGET_ADJUST_UNROLL): Define.
+   * config/i386/x86-tune.def
+   (X86_TUNE_ADJUST_UNROLL): Define.
+

 The function comment is missing. Maybe you should also describe magic number 
 32 here?
Added the function comment.

 Otherwise, the patch looks OK.
Thanks.  Bootstrapping passes. Is it OK for upstream?

 BTW: Please avoid top-posting, see e.g. [1] for reasons...
Sorry for the lapse. Will comply.

Regards
Ganesh



unroll-adjust.patch
Description: unroll-adjust.patch


RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-19 Thread Gopalasubramanian, Ganesh
 Sorry, I must have been looking at an older version, but as I said I already 
 did enable it in the latest patch. (see 
 http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01577.html )

Sorry for causing another revision but we would like to stick with btver1 and 
btver2 rather than BOBCAT or JAGUAR.
Therefore the changes would be like

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 206065)
+++ gcc/config/i386/i386.c  (working copy)
@@ -29965,9 +29965,14 @@
 P_PROC_SSE4_2,
 P_POPCNT,
 P_AVX,
+P_PROC_AVX,
+P_FMA4,
+P_XOP,
+P_PROC_XOP,
+P_FMA,
+P_PROC_FMA,
 P_AVX2,
-P_FMA,
-P_PROC_FMA
+P_PROC_AVX2
   };
 
  enum feature_priority priority = P_ZERO;
@@ -29986,11 +29991,15 @@
   {sse, P_SSE},
   {sse2, P_SSE2},
   {sse3, P_SSE3},
+  {sse4a, P_SSE4_a},
   {ssse3, P_SSSE3},
   {sse4.1, P_SSE4_1},
   {sse4.2, P_SSE4_2},
   {popcnt, P_POPCNT},
   {avx, P_AVX},
+  {fma4, P_FMA4},
+  {xop, P_XOP},
+  {fma, P_FMA},
   {avx2, P_AVX2}
 };
 
@@ -30044,25 +30053,49 @@
  break;
 case PROCESSOR_COREI7_AVX:
   arg_str = corei7-avx;
-  priority = P_PROC_SSE4_2;
+  priority = P_PROC_AVX;
   break;
+case PROCESSOR_HASWELL:
+  arg_str = core-avx2;
+  priority = P_PROC_AVX2;
+  break;
case PROCESSOR_ATOM:
  arg_str = atom;
  priority = P_PROC_SSSE3;
  break;
+case PROCESSOR_SLM:
+  arg_str = slm;
+  priority = P_PROC_SSE4_2;
+  break;
case PROCESSOR_AMDFAM10:
  arg_str = amdfam10h;
  priority = P_PROC_SSE4_a;
  break;
+case PROCESSOR_BTVER1:
+  arg_str = btver1;
+  priority = P_PROC_SSE4_a;
+  break;
+case PROCESSOR_BTVER2:
+  arg_str = btver2;
+  priority = P_PROC_AVX;
+  break;
case PROCESSOR_BDVER1:
  arg_str = bdver1;
- priority = P_PROC_FMA;
+ priority = P_PROC_XOP;
  break;
case PROCESSOR_BDVER2:
  arg_str = bdver2;
  priority = P_PROC_FMA;
  break;
-   }  
+case PROCESSOR_BDVER3:
+  arg_str = bdver3;
+  priority = P_PROC_FMA;
+  break;
+case PROCESSOR_BDVER4:
+  arg_str = bdver4;
+  priority = P_PROC_AVX2;
+  break;
+}  
}
 
   cl_target_option_restore (global_options, cur_target);
@@ -30922,9 +30955,13 @@
 F_SSE2,
 F_SSE3,
 F_SSSE3,
+F_SSE4_a,
 F_SSE4_1,
 F_SSE4_2,
 F_AVX,
+F_FMA4,
+F_XOP,
+F_FMA,
 F_AVX2,
 F_MAX
   };
@@ -30943,6 +30980,10 @@
 M_AMDFAM10H,
 M_AMDFAM15H,
 M_INTEL_SLM,
+M_INTEL_COREI7_AVX,
+M_INTEL_CORE_AVX2,
+M_AMD_BTVER1,
+M_AMD_BTVER2,
 M_CPU_SUBTYPE_START,
 M_INTEL_COREI7_NEHALEM,
 M_INTEL_COREI7_WESTMERE,
@@ -30953,7 +30994,9 @@
 M_AMDFAM15H_BDVER1,
 M_AMDFAM15H_BDVER2,
 M_AMDFAM15H_BDVER3,
-M_AMDFAM15H_BDVER4
+M_AMDFAM15H_BDVER4,
+M_INTEL_COREI7_IVYBRIDGE,
+M_INTEL_CORE_HASWELL
   };
 
   static struct _arch_names_table
@@ -30971,11 +31014,17 @@
   {corei7, M_INTEL_COREI7},
   {nehalem, M_INTEL_COREI7_NEHALEM},
   {westmere, M_INTEL_COREI7_WESTMERE},
+  {corei7-avx, M_INTEL_COREI7_AVX},
   {sandybridge, M_INTEL_COREI7_SANDYBRIDGE},
+  {ivybridge, M_INTEL_COREI7_IVYBRIDGE},
+  {core-avx2, M_INTEL_CORE_AVX2},
+  {haswell, M_INTEL_CORE_HASWELL},
   {amdfam10h, M_AMDFAM10H},
   {barcelona, M_AMDFAM10H_BARCELONA},
   {shanghai, M_AMDFAM10H_SHANGHAI},
   {istanbul, M_AMDFAM10H_ISTANBUL},
+  {btver1, M_AMD_BTVER1},
+  {btver2, M_AMD_BTVER2},
   {amdfam15h, M_AMDFAM15H},
   {bdver1, M_AMDFAM15H_BDVER1},
   {bdver2, M_AMDFAM15H_BDVER2},
@@ -30997,9 +31046,13 @@
   {sse2,   F_SSE2},
   {sse3,   F_SSE3},
   {ssse3,  F_SSSE3},
+  {sse4a,  F_SSE4_a},
   {sse4.1, F_SSE4_1},
   {sse4.2, F_SSE4_2},
   {avx,F_AVX},
+  {fma4,   F_FMA4},
+  {xop,F_XOP},
+  {fma,F_FMA},
   {avx2,   F_AVX2}
 };

Index: libgcc/config/i386/cpuinfo.c
===
--- libgcc/config/i386/cpuinfo.c(revision 206065)
+++ libgcc/config/i386/cpuinfo.c(working copy)
@@ -62,6 +62,10 @@
   AMDFAM10H,
   AMDFAM15H,
   INTEL_SLM,
+  INTEL_COREI7_AVX,
+  INTEL_CORE_AVX2,
+  AMD_BTVER1,
+  AMD_BTVER2,
   CPU_TYPE_MAX
 };
 
@@ -75,6 +79,10 @@
   AMDFAM10H_ISTANBUL,
   AMDFAM15H_BDVER1,
   AMDFAM15H_BDVER2,
+  AMDFAM15H_BDVER3,
+  AMDFAM15H_BDVER4,
+  

RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-18 Thread Gopalasubramanian, Ganesh

Ping!

Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote:


 Yes, I figured that was the original idea behind it, but the final family of 
 the jaguar processors seems to have become 16h instead of 14h (bobcat) at 
 some point.

Yes. It is amdfam16h. I was supposed to pass on some comments on the patch.
1. Amdfam16h for Jaguar.
2. For Jaguar, the priority needs to be AVX (AVX got included into the Jaguar 
ISA).

I have a doubt! What would be done if priority is set to F_FMA4 instead of 
F_XOP for bdver1?

Regards
Ganesh




RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-18 Thread Gopalasubramanian, Ganesh
 Yes, I changed that in the last patch, though I consider it momentarily 
 problematic because you do not yet enable AVX with march=btver2 (AVX versions 
 would currently be better than btver2 versions for a btver2 arch), but expect
march=btver2 will be fixed soon.

The  processor_alias_table entry for btver2 in i386.c enables AVX.

snip
  {btver2, PROCESSOR_BTVER2, CPU_BTVER2,
PTA_64BIT | PTA_MMX |  PTA_SSE  | PTA_SSE2 | PTA_SSE3
| PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_SSE4_1
| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
| PTA_BMI | PTA_F16C | PTA_MOVBE | PTA_PRFCHW
| PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT},
/snip

The assembly listing for a simple test (compiled with -march=btver2) also has 
-mavx enabled. So, can you please enable AVX for btver2?

Regards
Ganesh



RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-16 Thread Gopalasubramanian, Ganesh
 Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that 
 an old term for what has become the Jaguar architecture?

Yes, btver2 = jaguar. We have the name as per its family name (i.e, bobcat 
family) in GCC. 
Similarly we have the names bdver2 = piledriver, bdver3 = steamroller 
as per their family (bulldozer) name.

Regards
Ganesh

-Original Message-
From: Allan Sandfeld Jensen [mailto:carew...@gmail.com] 
Sent: Monday, December 16, 2013 12:25 AM
To: Uros Bizjak
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [Patch, i386] PR 59422 - Support more targets for function multi 
versioning

Hi again
On Wednesday 11 December 2013, Uros Bizjak wrote:
 Hello!
 
  PR gcc/59422
  
  This patch extends the supported targets for function multi versiong 
  to also include Haswell, Silvermont, and the most recent AMD models. 
  It also prioritizes AVX2 versions over AMD specific pre-AVX2 versions.
 
 Please add a ChangeLog entry and attach the complete patch. Please 
 also state how you tested the patch, as outlined in the instructions 
 [1].
 
 [1] http://gcc.gnu.org/contribute.html
 
Updated patch for better CPU model detection and added ChangeLog.

The patch has been tested with the attached test.cpp. Verified that it doesn't 
build before the patch, and that it builds after, and verified it selects 
correct versions at runtime based on either CPU model or supported ISA (tested 
on 3 machines: SandyBridge, IvyBridge and Phenom II).

Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that an 
old term for what has become the Jaguar architecture?

`Allan



RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-16 Thread Gopalasubramanian, Ganesh
 Yes, I figured that was the original idea behind it, but the final family of 
 the jaguar processors seems to have become 16h instead of 14h (bobcat) at 
 some point.

Yes. It is amdfam16h. I was supposed to pass on some comments on the patch.
1. Amdfam16h for Jaguar.
2. For Jaguar, the priority needs to be AVX (AVX got included into the Jaguar 
ISA).

I have a doubt! What would be done if priority is set to F_FMA4 instead of 
F_XOP for bdver1?

Regards
Ganesh




RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-12-11 Thread Gopalasubramanian, Ganesh
Hi Uros!

Accommodated the changes that you mentioned.
Completed the bootstrap testing too.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Wednesday, December 04, 2013 3:17 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Richard Guenther richard.guent...@gmail.com 
(richard.guent...@gmail.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Dec 4, 2013 at 9:39 AM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 Attached is the revised patch.
 The target independent part has been already approved and added.

 This revision of the patch adds a x86 tune definition and checks it while 
 deciding the unroll factor.

 Accommodated the comments given by you except one.

 *x will never be null for active insns.
 Since every rtx in the insn is checked for memory references, the NULL_RTX 
 check is required.

Yes you are correct. for_each_rtx also passes NULL_RTX, I was distracted by 
There are no sub-expressions. comment.

+if (NONDEBUG_INSN_P (insn)  INSN_CODE (insn) != -1)

Do you need to check for INSN_CODE here? IIRC, checking for NONDEBUG_INSN_P is 
enough.

+for_each_rtx (insn, (rtx_function) ix86_loop_memcount,
mem_count);
+}
+  free (bbs);
+
+  if (mem_count =32)
+return 32/mem_count;

Ouch... mem_count can be zero. Is there a reason to change this part from 
previous patch?

Uros.



unroll-adjust.patch
Description: unroll-adjust.patch


[patch][wwwdocs] gcc 4.9 changes - AMD new cores

2013-12-05 Thread Gopalasubramanian, Ganesh
Hello, 

This patch adds details about new AMD cores that got enabled in GCC-4.9.
OK for the wwwdocs? 

Regards
Ganesh

cvs diff: Diffing .
Index: changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.9/changes.html,v
retrieving revision 1.44
diff -r1.44 changes.html
404a405,407
 liSupport for new AMD family 15h processors (Excavator core)
 is now available through the code-march=bdver4/code and
 code-mtune=bdver4/code options./li





RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-12-04 Thread Gopalasubramanian, Ganesh
Hi Uros!

Attached is the revised patch.
The target independent part has been already approved and added.

This revision of the patch adds a x86 tune definition and checks it while 
deciding the unroll factor.

Accommodated the comments given by you except one.

 *x will never be null for active insns.
Since every rtx in the insn is checked for memory references, the NULL_RTX 
check is required.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Friday, November 22, 2013 1:46 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Richard Guenther richard.guent...@gmail.com 
(richard.guent...@gmail.com); borntrae...@de.ibm.com; H.J. Lu 
(hjl.to...@gmail.com); Jakub Jelinek (ja...@redhat.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Nov 20, 2013 at 7:26 PM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 Steamroller processors contain a loop predictor and a loop buffer, which may 
 make unrolling small loops less important.
 When unrolling small loops for steamroller, making the unrolled loop fit in 
 the loop buffer should be a priority.

 This patch uses a heuristic approach (number of memory references) to decide 
 the unrolling factor for small loops.
 This patch has some noise in SPEC 2006 results.

 Bootstrapping passes.

 I would like to know your comments before committing.

Please split the patch to target-dependant and target-independant part, and get 
target-idependant part reviewed first.

This part:

+  if (ix86_tune != PROCESSOR_BDVER3  ix86_tune != PROCESSOR_BDVER4)  
+ {
+return nunroll;
+  }

is wrong. You should introduce tune variable (as H.J. suggested) and check that 
variable here. Target dependant tuning options should be in x86-tune.def, so 
everything regarding tuning can be found in one place.

+if (INSN_P (insn)  INSN_CODE (insn) != -1)
+for_each_rtx (insn, (rtx_function) ix86_loop_memcount,
mem_count);

if (NONDEBUG_INSN_P (insn))
  for_each_rtx (PATTERN(insn), ...);

otherwise your heuristics will depend on -g compile option.

+  if ( (mem_count*nunroll) = 32)

Extra parenthesis.

+static int
+ix86_loop_memcount (rtx *x, unsigned *mem_count) {
+  if (*x != NULL_RTX  MEM_P (*x))

*x will never be null for active insns.

Uros.



unroll-adjust.patch
Description: unroll-adjust.patch


RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-12-04 Thread Gopalasubramanian, Ganesh
 Ouch... mem_count can be zero. Is there a reason to change this part from 
 previous patch?

Oops!  You're right. I will correct this.

The idea is to count the memory references and decide on the unrolling factor.
Previous patch does that in two steps I thought of doing that in a single step. 
(I think I missed my step here ;) )

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Wednesday, December 04, 2013 3:17 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Richard Guenther richard.guent...@gmail.com 
(richard.guent...@gmail.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Dec 4, 2013 at 9:39 AM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 Attached is the revised patch.
 The target independent part has been already approved and added.

 This revision of the patch adds a x86 tune definition and checks it while 
 deciding the unroll factor.

 Accommodated the comments given by you except one.

 *x will never be null for active insns.
 Since every rtx in the insn is checked for memory references, the NULL_RTX 
 check is required.

Yes you are correct. for_each_rtx also passes NULL_RTX, I was distracted by 
There are no sub-expressions. comment.

+if (NONDEBUG_INSN_P (insn)  INSN_CODE (insn) != -1)

Do you need to check for INSN_CODE here? IIRC, checking for NONDEBUG_INSN_P is 
enough.

+for_each_rtx (insn, (rtx_function) ix86_loop_memcount,
mem_count);
+}
+  free (bbs);
+
+  if (mem_count =32)
+return 32/mem_count;

Ouch... mem_count can be zero. Is there a reason to change this part from 
previous patch?

Uros.




RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-11-28 Thread Gopalasubramanian, Ganesh
This patch adds influence of macro TARGET_LOOP_UNROLL_ADJUST during constant 
iterations (decide_unroll_constant_iterations).
The macro has been already checked for runtime iterations 
(decide_unroll_runtime_iterations),  and for unroll stupid 
(decide_unroll_stupid).

Bootstrapping and test passes.

Would like to know your comments before committing.

Regards
Ganesh

2013-11-28  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
 
* loop-unroll.c (decide_unroll_constant_iterations): Check macro 
TARGET_LOOP_UNROLL_ADJUST while deciding unroll factor.


diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c
index 9c87167..557915f 100644
--- a/gcc/loop-unroll.c
+++ b/gcc/loop-unroll.c
@@ -664,6 +664,9 @@ decide_unroll_constant_iterations (struct loop *loop, int 
flags)
   if (nunroll  (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
 nunroll = PARAM_VALUE (PARAM_MAX_UNROLL_TIMES);

+  if (targetm.loop_unroll_adjust)
+nunroll = targetm.loop_unroll_adjust (nunroll, loop);
+
   /* Skip big loops.  */
   if (nunroll = 1)
 {

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Friday, November 22, 2013 1:46 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Richard Guenther richard.guent...@gmail.com 
(richard.guent...@gmail.com); borntrae...@de.ibm.com; H.J. Lu 
(hjl.to...@gmail.com); Jakub Jelinek (ja...@redhat.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Nov 20, 2013 at 7:26 PM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 Steamroller processors contain a loop predictor and a loop buffer, which may 
 make unrolling small loops less important.
 When unrolling small loops for steamroller, making the unrolled loop fit in 
 the loop buffer should be a priority.

 This patch uses a heuristic approach (number of memory references) to decide 
 the unrolling factor for small loops.
 This patch has some noise in SPEC 2006 results.

 Bootstrapping passes.

 I would like to know your comments before committing.

Please split the patch to target-dependant and target-independant part, and get 
target-idependant part reviewed first.



RE: [PATCH, i386]: Fix PR56788, _mm_frcz_sd and _mm_frcz_ss ignore their second argument

2013-11-26 Thread Gopalasubramanian, Ganesh
 Hopefully someone from AMD will provide tests that are mysteriously missing 
 from XOP testsuite.

As pointed out by Marc, I added myself to the bug later. 
I was bit confused about the internal insn representation with user-visible 
function.
So, couldn't add test then and there. I could have solved that earlier. Sorry 
for that.

Attached is the test that checks the (controversial) frcz functions.

Uros could you please add this to your patch while committing.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Saturday, November 23, 2013 6:49 PM
To: gcc-patches@gcc.gnu.org
Cc: Cong Hou; Marc Glisse; Gopalasubramanian, Ganesh
Subject: [PATCH, i386]: Fix PR56788, _mm_frcz_sd and _mm_frcz_ss ignore their 
second argument

Hello!

Attached patch fixes PR56788, where _mm_frcz_{ss,sd} intrinsics ignored their 
second argument.

As explained in the PR [1], gcc implements two-operand vector-merge
form as documented in Microsoft's definition [2]. However, in contrast to other 
SSE scalar insns, the instruction itself clears upper bits to zero.

There were a couple of problems: the builtin was declared as builtin with two 
input operands, but the number of input operands didn't correspond to referred 
insn pattern, leaving its second operand uninitialized. The intrinsic was also 
implemented without necessary movss/movsd fixup that would merge both its 
operands in a correct way.

Please also note that the definition in clang is wrong.

I didn't include any testcase in the patch, since I don't have access to XOP 
target. Hopefully someone from AMD will provide tests that are mysteriously 
missing from XOP testsuite.

2013-11-23  Uros Bizjak  ubiz...@gmail.com

PR target/56788
* config/i386/i386.c (bdesc_multi_arg) IX86_BUILTIN_VFRCZSS:
Declare as MULTI_ARG_1_SF instruction.
IX86_BUILTIN_VFRCZSD: Decleare as MULTI_ARG_1_DF instruction.
* config/i386/sse.md (*xop_vmfrczmode2): Rename
from *xop_vmfrcz_mode.
* config/i386/xopintrin.h (_mm_frcz_ss): Use __builtin_ia32_movss
to merge scalar result with __A.
(_mm_frcz_sd): Use __builtin_ia32_movsd to merge scalar
result with __A.

Patch was bootstrapped and regression tested on x86_64-pc-linux-gnu {,-m32}.

The patch was committed to mainline SVN and will be committed to other release 
branches in a couple of days (hopefully with additional tests).

[1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56788
[2] http://msdn.microsoft.com/en-us/library/vstudio/gg445126%28v=vs.100%29.aspx

Uros.
#include x86intrin.h
#include m128-check.h

void
check_mm_vmfrcz_sd (__m128d __A, __m128d __B)
{
  union128d a, b, c;
  double d[2];

  a.x = __A;
  b.x = __B;
  c.x = _mm_frcz_sd (__A, __B);
  d[0] = b.a[0] - (int)b.a[0] ;
  d[1] = a.a[1];
  if (check_union128d (c, d))
abort ();
}

void
check_mm_vmfrcz_ss (__m128 __A, __m128 __B)
{
  union128 a, b, c;
  float f[4];

  a.x = __A;
  b.x = __B;
  c.x = _mm_frcz_ss (__A, __B);
  f[0] = b.a[0] - (int)b.a[0] ;
  f[1] = a.a[1];
  f[2] = a.a[2];
  f[3] = a.a[3];
  if (check_union128 (c, f))
abort ();
}

void
main (void)
{
  union128 a, b;
  union128d c,d;
  int i;

  for (i = 0; i  4; i++)
{
   a.a[i] = i + 3.5;
   b.a[i] = i + 7.9;
}
  for (i = 0; i  2; i++)
{
   c.a[i] = i + 3.5;
   d.a[i] = i + 7.987654321;
}
  check_mm_vmfrcz_ss (a.x, b.x);
  check_mm_vmfrcz_sd (c.x, d.x);
}

RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-11-21 Thread Gopalasubramanian, Ganesh
Ping!

-Original Message-
From: Gopalasubramanian, Ganesh 
Sent: Thursday, November 21, 2013 10:35 AM
To: 'H.J. Lu'
Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther 
richard.guent...@gmail.com (richard.guent...@gmail.com); 
borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com)
Subject: RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

 I suggest you add this to x86-tune.def and enable it for
 bdver3 and bdver4.

The macro TARGET_LOOP_UNROLL_ADJUST is not new. It is already available and is 
used by target s390.
Since it is not an x86 only feature I didn't add that in x86-tune.def.

Regards
Ganesh

-Original Message-
From: H.J. Lu [mailto:hjl.to...@gmail.com] 
Sent: Thursday, November 21, 2013 12:02 AM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther 
richard.guent...@gmail.com (richard.guent...@gmail.com); 
borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Nov 20, 2013 at 10:26 AM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:
 Hi,

 Steamroller processors contain a loop predictor and a loop buffer, which may 
 make unrolling small loops less important.
 When unrolling small loops for steamroller, making the unrolled loop fit in 
 the loop buffer should be a priority.

 This patch uses a heuristic approach (number of memory references) to decide 
 the unrolling factor for small loops.
 This patch has some noise in SPEC 2006 results.

 Bootstrapping passes.

 I would like to know your comments before committing.


I suggest you add this to x86-tune.def and enable it for
bdver3 and bdver4.



--
H.J.




[RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-11-20 Thread Gopalasubramanian, Ganesh
Hi,

Steamroller processors contain a loop predictor and a loop buffer, which may 
make unrolling small loops less important.
When unrolling small loops for steamroller, making the unrolled loop fit in the 
loop buffer should be a priority.

This patch uses a heuristic approach (number of memory references) to decide 
the unrolling factor for small loops.
This patch has some noise in SPEC 2006 results. 

Bootstrapping passes.

I would like to know your comments before committing.

Regards
Ganesh


loop_unroll_bdver3.patch
Description: loop_unroll_bdver3.patch


RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-11-20 Thread Gopalasubramanian, Ganesh
 I suggest you add this to x86-tune.def and enable it for
 bdver3 and bdver4.

The macro TARGET_LOOP_UNROLL_ADJUST is not new. It is already available and is 
used by target s390.
Since it is not an x86 only feature I didn't add that in x86-tune.def.

Regards
Ganesh

-Original Message-
From: H.J. Lu [mailto:hjl.to...@gmail.com] 
Sent: Thursday, November 21, 2013 12:02 AM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther 
richard.guent...@gmail.com (richard.guent...@gmail.com); 
borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Nov 20, 2013 at 10:26 AM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:
 Hi,

 Steamroller processors contain a loop predictor and a loop buffer, which may 
 make unrolling small loops less important.
 When unrolling small loops for steamroller, making the unrolled loop fit in 
 the loop buffer should be a priority.

 This patch uses a heuristic approach (number of memory references) to decide 
 the unrolling factor for small loops.
 This patch has some noise in SPEC 2006 results.

 Bootstrapping passes.

 I would like to know your comments before committing.


I suggest you add this to x86-tune.def and enable it for
bdver3 and bdver4.



--
H.J.




RE: Honnor ix86_accumulate_outgoing_args again

2013-11-12 Thread Gopalasubramanian, Ganesh
 we are going to have some AMD CPU with AVX2 support soon, the question is
 if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even
 if it will prefer split, the question is if like bdver{1,2,3} it will
 be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned
 loads/stores are handled is much less important there.  Ganesh?

256-bit is friendly on bdver4. 
But, 256 bit unaligned stores are micro-coded which we would like to avoid. So 
we require 128-bit MOVUPS.

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Tuesday, November 12, 2013 3:57 PM
To: Jan Hubicka
Cc: H.J. Lu; Vladimir Makarov; GCC Patches; Uros Bizjak; Richard Henderson; 
Gopalasubramanian, Ganesh
Subject: Re: Honnor ix86_accumulate_outgoing_args again

On Tue, Nov 12, 2013 at 11:05:45AM +0100, Jan Hubicka wrote:
  @@ -16576,7 +16576,7 @@ ix86_avx256_split_vector_move_misalign (rtx 
  op0, rtx op1)
  
 if (MEM_P (op1))
   {
  -  if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
  +  if (!TARGET_AVX2  TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
   {
 rtx r = gen_reg_rtx (mode);
 m = adjust_address (op1, mode, 0); @@ -16596,7 +16596,7 @@ 
  ix86_avx256_split_vector_move_misalign (rtx op0, rtx op1)
   }
 else if (MEM_P (op0))
   {
  -  if (TARGET_AVX256_SPLIT_UNALIGNED_STORE)
  +  if (!TARGET_AVX2  TARGET_AVX256_SPLIT_UNALIGNED_STORE)
 
 I would add explanation comment on those two.

Looking at http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01235.html
we are going to have some AMD CPU with AVX2 support soon, the question is
if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even
if it will prefer split, the question is if like bdver{1,2,3} it will
be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned
loads/stores are handled is much less important there.  Ganesh?

 Shall we also disable argument accumulation for cores? It seems we won't
 solve the IRA issues, right?

You mean LRA issues here, right?  If you are starting to use
no-accumulate-outgoing-args much more often than in the past, I think
the problem that LRA forces a frame pointer in that case is much more
important now (or has that been fixed in the mean time?).  Vlad?

Jakub




RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-11-05 Thread Gopalasubramanian, Ganesh
Thanks Honza!

I have committed changes ( for default ).
http://gcc.gnu.org/viewcvs/gcc?view=revisionrevision=204442

I will add lookahead value 8 for O3 after experimenting with it.

Regards
Ganesh

-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Wednesday, October 30, 2013 1:54 AM
To: Richard Biener
Cc: Jan Hubicka; Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org; Uros 
Bizjak (ubiz...@gmail.com); H.J. Lu (hjl.to...@gmail.com)
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 
chips

 On Fri, 25 Oct 2013, Jan Hubicka wrote:
 
OK, so it is about 2%.  Did you try if you need lookahead even in the 
early pass (before reload)?  My guess would be so, but if not, it could 
cut the cost to half.  For -Ofast/-O3 it looks resonable to me, but we 
will  need to announce it on the ML.  For other settings I think we 
need to work on more improvements or cut the expenses.
   
   Yes, it is required before reload.  
   
   I have another idea which can be pondered upon. Currently, can we enable 
   lookahead with the value 4 (pre reload) for default? This will 
   exponentially cut the cost of build time. 
   I have done some measurements on the build time of some benchmarks 
   (mentioned below) with lookahead value 4. The 2% increase in build time 
   with value 8 is now almost gone.
   
  dfa4   no_lookahead

perlbench   - 191s  193s
bzip2   - 19s   19s
gcc - 429s  429s
mcf - 3s3s
gobmk   - 116s  115s
hmmer   - 60s   60s
sjeng   - 18s   17s
libquantum  - 6s6s
h264ref - 107s  107s
omnetpp - 128s  128s
astar   - 7s7s
bwaves  - 5s5s
gamess  - 1964s 1957s
milc- 18s   18s
GemsFDTD- 273s  272s
   
   Lookahead value 4 also helps because, the modified decoder model in 
   bdver3.md is only two cycles deep (though in hardware it is actually 4 
   cycles deep). This means that we can look another two levels deep for 
   better schedule.
   GemsFDTD still retains the performance boost of around 6-7% with value 4.
   
   Let me know your thoughts.
  
  This seems resonable.  I would go for lookahead of 4 for now and 8 
  for -Ofast and we can tune things based on the experience with this setting 
  incrementally.
  Uros, Richard, what do you think?
 
 Well, certainly -O3 not -Ofast.

Yes, enabling 4 by default and 8 at -O3 seems fine to me.

Honza
 
 Richard.




RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-25 Thread Gopalasubramanian, Ganesh
 OK, so it is about 2%.  Did you try if you need lookahead even in the early 
 pass (before reload)?  My guess would be so, but if not, it could cut the 
 cost to half.  For -Ofast/-O3 it looks resonable to me, but we will  need to 
 announce it on the ML.  For other settings I think we need to work on more 
 improvements or cut the expenses.

Yes, it is required before reload.  

I have another idea which can be pondered upon. Currently, can we enable 
lookahead with the value 4 (pre reload) for default? This will exponentially 
cut the cost of build time. 
I have done some measurements on the build time of some benchmarks (mentioned 
below) with lookahead value 4. The 2% increase in build time with value 8 is 
now almost gone.

   dfa4   no_lookahead
 
 perlbench   - 191s  193s
 bzip2   - 19s   19s
 gcc - 429s  429s
 mcf - 3s3s
 gobmk   - 116s  115s
 hmmer   - 60s   60s
 sjeng   - 18s   17s
 libquantum  - 6s6s
 h264ref - 107s  107s
 omnetpp - 128s  128s
 astar   - 7s7s
 bwaves  - 5s5s
 gamess  - 1964s 1957s
 milc- 18s   18s
 GemsFDTD- 273s  272s

Lookahead value 4 also helps because, the modified decoder model in bdver3.md 
is only two cycles deep (though in hardware it is actually 4 cycles deep). This 
means that we can look another two levels deep for better schedule.
GemsFDTD still retains the performance boost of around 6-7% with value 4.

Let me know your thoughts.

Regards
Ganesh

-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Thursday, October 24, 2013 6:48 PM
To: Gopalasubramanian, Ganesh
Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); H.J. 
Lu (hjl.to...@gmail.com)
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 
chips

 Hi,
 
  Is this with -fschedule-insns? Or only with default settings?  Did you test 
  the compile time implications of increasing the lookahead? (value of 8 is 
  very large, we may consider enbling it only for -Ofast, limiting for 
  postreload only or something similar).
 
 The improvement is seen with the options -fschedule-insns  -fschedule-insns2 
 -fsched-pressure
 
 Below are the build times of some of the SPEC benchmarks
 
   dfa8   no_lookahead
 
 perlbench   - 196s  193s
 bzip2   - 19s   19s
 gcc - 439s  429s
 mcf - 3s3s
 gobmk   - 119s  115s
 hmmer   - 62s   60s
 sjeng   - 18s   17s
 libquantum  - 6s6s
 h264ref - 110s  107s
 omnetpp - 132s  128s
 astar   - 7s7s
 bwaves  - 4s5s
 gamess  - 1996s 1957s
 milc- 18s   18s
 GemsFDTD- 276s  272s
 
 I think we can enable it by default rather than for -Ofast.
 Please let me know your inputs.

OK, so it is about 2%.  Did you try if you need lookahead even in the early 
pass (before reload)?  My guess would be so, but if not, it could cut the cost 
to half.  For -Ofast/-O3 it looks resonable to me, but we will need to announce 
it on the ML.  For other settings I think we need to work on more improvmeents 
or cut the expenses.

Honza
 
 Regards
 Ganesh
 
 -Original Message-
 From: Jan Hubicka [mailto:hubi...@ucw.cz]
 Sent: Thursday, October 24, 2013 2:54 PM
 To: Gopalasubramanian, Ganesh
 Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); 
 hubi...@ucw.cz; H.J. Lu (hjl.to...@gmail.com)
 Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for 
 modern x86 chips
 
  Attached is the patch which does the following scheduler related changes.
  * re-models bdver3 decoder.
  * It enables lookahead with value 8 for all BD architectures. The patch 
  doesn't consider if reloading is completed or not (an area that needs to be 
  worked on).
  * The issue rate for BD architectures are set to 4.
  
  I see the following performance improvements on bdver3 machine.
  * GemsFDTD improves by 6-7% with lookahead value changed to 8.
  * Hmmer improves by 9% when issue rate when set to 4 .
 
 Is this with -fschedule-insns? Or only with default settings?  Did you test 
 the compile time implications of increasing the lookahead? (value of 8 is 
 very large, we may consider enbling it only for -Ofast, limiting for 
 postreload only or something similar).
 
  
  I have considered the following hardware details for the model.
  * There are four decoders inside a hardware decoder block.
  * These four independent decoders can execute in parallel.  (They can take 
  8B from four different instructions and decode).
  * These four decoders are pipelined 4 cycles deep and are non-stalling

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-24 Thread Gopalasubramanian, Ganesh
Attached is the patch which does the following scheduler related changes.
* re-models bdver3 decoder.
* It enables lookahead with value 8 for all BD architectures. The patch doesn't 
consider if reloading is completed or not (an area that needs to be worked on).
* The issue rate for BD architectures are set to 4.

I see the following performance improvements on bdver3 machine.
* GemsFDTD improves by 6-7% with lookahead value changed to 8.
* Hmmer improves by 9% when issue rate when set to 4 .

I have considered the following hardware details for the model.
* There are four decoders inside a hardware decoder block.
* These four independent decoders can execute in parallel.  (They can take 8B 
from four different instructions and decode).
* These four decoders are pipelined 4 cycles deep and are non-stalling.
* Each decoder takes 8B of instruction data every cycle and tries decoding it. 
* Issue rate is 4.

Is it OK for upstream?

Changelog

2013-10-24  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com

* config/i386/bdver3.md : Added two additional decoder units 
to support issue rate of 4 and remodeled vector unit.

* config/i386/i386.c (ix86_issue_rate): Issue rate for BD
architectures is set to 4.

* config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA
lookahead is set to 8 for BD architectures.

Regards
Ganesh



bdver3_issue_rate_lookahead.patch
Description: bdver3_issue_rate_lookahead.patch


RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-24 Thread Gopalasubramanian, Ganesh
Hi, 

 Is this with -fschedule-insns? Or only with default settings?  Did you test 
 the compile time implications of increasing the lookahead? (value of 8 is 
 very large, we may consider enbling it only for -Ofast, limiting for 
 postreload only or something similar).

The improvement is seen with the options -fschedule-insns  -fschedule-insns2 
-fsched-pressure

Below are the build times of some of the SPEC benchmarks

  dfa8   no_lookahead

perlbench   - 196s  193s
bzip2   - 19s   19s
gcc - 439s  429s
mcf - 3s3s
gobmk   - 119s  115s
hmmer   - 62s   60s
sjeng   - 18s   17s
libquantum  - 6s6s
h264ref - 110s  107s
omnetpp - 132s  128s
astar   - 7s7s
bwaves  - 4s5s
gamess  - 1996s 1957s
milc- 18s   18s
GemsFDTD- 276s  272s

I think we can enable it by default rather than for -Ofast.
Please let me know your inputs.

Regards
Ganesh

-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Thursday, October 24, 2013 2:54 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); hubi...@ucw.cz; 
H.J. Lu (hjl.to...@gmail.com)
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 
chips

 Attached is the patch which does the following scheduler related changes.
 * re-models bdver3 decoder.
 * It enables lookahead with value 8 for all BD architectures. The patch 
 doesn't consider if reloading is completed or not (an area that needs to be 
 worked on).
 * The issue rate for BD architectures are set to 4.
 
 I see the following performance improvements on bdver3 machine.
 * GemsFDTD improves by 6-7% with lookahead value changed to 8.
 * Hmmer improves by 9% when issue rate when set to 4 .

Is this with -fschedule-insns? Or only with default settings?  Did you test the 
compile time implications of increasing the lookahead? (value of 8 is very 
large, we may consider enbling it only for -Ofast, limiting for postreload only 
or something similar).

 
 I have considered the following hardware details for the model.
 * There are four decoders inside a hardware decoder block.
 * These four independent decoders can execute in parallel.  (They can take 8B 
 from four different instructions and decode).
 * These four decoders are pipelined 4 cycles deep and are non-stalling.
 * Each decoder takes 8B of instruction data every cycle and tries decoding 
 it. 
 * Issue rate is 4.
What is the overall limitation on number of bytes the instructions can occupy?
I think they need to fit into 2 16 byte windows, right?
In that case we may want to tweak the existing corei7 scheduling code to take 
care of this.  Making scheduler not overly optimistic about the parallelism is 
good since it will make less register pressure during the first pass.
 
 Is it OK for upstream?

Otherwise the patch seems OK, but I would like to know the compile time effect 
first.

Honza
 
 Changelog
 
 2013-10-24  Ganesh Gopalasubramanian  
 ganesh.gopalasubraman...@amd.com
 
   * config/i386/bdver3.md : Added two additional decoder units 
   to support issue rate of 4 and remodeled vector unit.
 
   * config/i386/i386.c (ix86_issue_rate): Issue rate for BD
   architectures is set to 4.
 
   * config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA
   lookahead is set to 8 for BD architectures.
 
 Regards
 Ganesh
 






[PATCH,i386] Enable FMA4 for AMD bdver3

2013-10-16 Thread Gopalasubramanian, Ganesh
Hi 

The below patch enables FMA4 for AMD bdver3 architectures.

make -k check passes.

Is it OK for upstream?

Regards
Ganesh


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index fb5b267..cbb5311 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-16 Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
+
+   * config/i386/i386.c (ix86_option_override_internal): Enable FMA4
+   for AMD bdver3.
+
 2013-10-16  Hans-Peter Nilsson  h...@axis.com

* config/cris/t-elfmulti (MULTILIB_OPTIONS, MULTILIB_DIRNAMES)
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index b5796db..c24ce36 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -3104,7 +3104,7 @@ ix86_option_override_internal (bool main_args_p,
   {bdver3, PROCESSOR_BDVER3, CPU_BDVER3,
PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
+   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4
| PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
| PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE
| PTA_XSAVEOPT | PTA_FSGSBASE},



RE: [PATCH,i386] Enable FMA4 for AMD bdver3

2013-10-16 Thread Gopalasubramanian, Ganesh
 4.8.2 is already rolling, so too late for that.

Is 4.8 branch (gcc/branches/gcc-4_8-branch) open?
If yes, shall I commit these changes?

Regards
Ganesh

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Wednesday, October 16, 2013 12:41 PM
To: Uros Bizjak
Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] Enable FMA4 for AMD bdver3

On Wed, Oct 16, 2013 at 09:00:58AM +0200, Uros Bizjak wrote:
 On Wed, Oct 16, 2013 at 8:28 AM, Gopalasubramanian, Ganesh 
 ganesh.gopalasubraman...@amd.com wrote:
 
  The below patch enables FMA4 for AMD bdver3 architectures.
 
  make -k check passes.
 
  +2013-10-16 Ganesh Gopalasubramanian  
  +ganesh.gopalasubraman...@amd.com
  +
  +   * config/i386/i386.c (ix86_option_override_internal): Enable FMA4
  +   for AMD bdver3.
  +
 
 OK for mainline and 4.8 branch (for 4.8.2 if approved by Jakub, 
 otherwise please wait for branch to open).

4.8.2 is already rolling, so too late for that.

Jakub




RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-11 Thread Gopalasubramanian, Ganesh
Hi Honza!

 I will give it a try.
I understand how the proposed patch would be. That is OK to me.

 OK, my undertanding always was, that the decoders works sort of sequentially.

 One decoder of bdver1 can do
 vector,
 double, single,
 single, single, signle

where decoder 1 is somehow special but hardware is able to swap first and 
second single.  Now if two decoders run, i would expect

 vector, vector
 single, single, signle
 double, single, double, single
 
 to be decoded in once cycle

My understanding on the decode unit is mentioned below. Please correct me if I 
am wrong.

The sequential allotment of decoders is not there for bdver1.
Intel Sandybridge\core2 have four decoders. The first decoder is special for 
intel processors. 
For ex, In Sandybridge, the instructions that have only one µop can be decoded 
by any of the four decoders.
Instructions that have up to four µops will be decoded by the first decoder (of 
the four decoders available) only.

Bdver1 has four decoders. None of them is special unlike intel processors.
For bdver1, microcoded instructions are single issue. All four decoders are 
engaged for decoding microcoded instructions.

Decode unit of bdver3 has following specifications
* Four independent decoders which can execute in parallel
* Microcoded instructions are single issue. (All four decoders are engaged). 
This means that only one vectorpath instruction get issued in one cycle.
* The additional hardware instruction decoder increases the instruction decode 
capacity to eight instructions per clock cycle.
* The decoders are pipelined 4 cycles deep and are non-stalling.

So modeling vectorpath instructions is straightforward. No instructions are 
issued along with vector instructions.
We need to model only fastpath single and fastpath double instructions.
There are four decoders and they can execute in parallel. So they can take 
either two double or four single instructions.
We also don't need to model them in two stage as there is no sequence involved.
So, the modeling can be done such that in one cycle we schedule  2 singles + 1 
double (or) 4 singles (or) 2 doubles.

I have tried to model this for bdver3 (code changes are mentioned below). 
Please let me know your opinion.

Regards
Ganesh

Patch
-
diff --git a/gcc/config/i386/bdver3.md b/gcc/config/i386/bdver3.md
index 52418b5..9e59395 100644
--- a/gcc/config/i386/bdver3.md
+++ b/gcc/config/i386/bdver3.md
@@ -34,6 +34,8 @@

 (define_cpu_unit bdver3-decode0 bdver3)
 (define_cpu_unit bdver3-decode1 bdver3)
+(define_cpu_unit bdver3-decode2 bdver3)
+(define_cpu_unit bdver3-decode3 bdver3)
 (define_cpu_unit bdver3-decodev bdver3)

 ;; Double decoded instructions take two cycles whereas
@@ -42,12 +44,15 @@
 ;; two decoders in two cycles.
 ;; Vectorpath instructions are single issue instructions.
 ;; So, we have separate unit for vector instructions.
-(exclusion_set bdver3-decodev bdver3-decode0,bdver3-decode1)
+(exclusion_set bdver3-decodev 
bdver3-decode0,bdver3-decode1,bdver3-decode2,bdver3-decode3)

 (define_reservation bdver3-vector bdver3-decodev)
-(define_reservation bdver3-direct (bdver3-decode0|bdver3-decode1))
+(define_reservation bdver3-direct 
(bdver3-decode0|bdver3-decode1|bdver3-decoder2|bdver3-decoder3))

-(define_reservation bdver3-double (bdver3-decode0|bdver3-decode1)*2)
+(define_reservation bdver3-double (bdver3-decode0+bdver3-decode1)|
+   (bdver3-decode1+bdver3-decode2)|(bdver3-decode2+bdver3-decode3)|
+   (bdver3-decode0+bdver3-decode2)|(bdver3-decode1+bdver3-decode3)|
+   (bdver3-decode0+bdver3-decode3)) 


-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Wednesday, October 09, 2013 7:18 PM
To: Gopalasubramanian, Ganesh
Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 
chips

 Before merging the insn reservations, I need to compare the latency values 
 for bdver1 and bdver3. I know that they are different for some of the 
 instructions. 
 In that case, the merging should prop up another subset of latency 
 differences. I would like to keep these insn reservations in two .md files 
 (one for bdver1 and one for bdver3) even after the merger.

I am not really insisting on merging (define_insn_reservation bdver3*) with 
(define_insn_reservation bdver1*).  What I have in mind is merging actual 
atuomatons in cases it makes sense.  Latencies are not really encoded in those.

Bdver 12 has:
(define_automaton bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu)
while bdver 3:
(define_automaton bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu)

automatons bdver1 and bdver3 are very different, because one handles up to 3 
instructions, while other handles only 2.  I am still bit confused with this 
every second cycle logic, so lets discuss it incrementally.

I would propose to have
(define_automaton bdver3)
or perhaps
(define_automaton bdver3,bdver3_fp)

now

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-09 Thread Gopalasubramanian, Ganesh
Hi Honza, 

 Yep, I think we need to merge only those autmatas tha are same for both:
 (define_automaton bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu)
 probably can become
 (define_automaton bdver3,bdver3_fp)
 with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu 
 changed to bdver1
 automaton.  I think it should result in smaller binary - the fact that all 
 conditionals are
 physically duplicated in bdver1/bdev3.md should be optimized away by 
 genautomata.

Before merging the insn reservations, I need to compare the latency values for 
bdver1 and bdver3. I know that they are different for some of the instructions. 
In that case, the merging should prop up another subset of latency differences. 
I would like to keep these insn reservations in two .md files (one for bdver1 
and one for bdver3) even after the merger.

 Your version has problem that it does not model the thing that the two 
 decoders works sequentially.

The two stage modeling is required so that the decode unit reservations are 
screened from other unit reservations.
But this sort of goes away in bdver3 because of the decode cycle.
In bdver3, the decode units scan two of these windows every two cycles 
decoding a maximum of eight instructions.
The hardware scan is done every two cycles in bdver3 whereas it is done every 
single cycle in bdver1/bdver2. (But we have two separate hardware decoders 
which guarantees higher throughput)
This means that the two stage modeling is not required in the scheduler 
descriptions since the hardware sort of guarantees that with its scanning 
mechanism. 
Our job is to make sure that 8 direct instructions get scheduled in two cycles 
or 4 double instructions get scheduled in two cycles.
So, I have modeled the bdver3 decoders such that with in a cycle they guarantee 
to issue 4 direct instructions or 2 double instructions. 
This eliminates the sequencing problem in modeling decoders and also ensures 
that the issue rate can be numbered for a single cycle rather than two cycles.
This is one of the reasons why I remodeled only bdver3. Let me know your 
comments on this.

 We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more 
 realistic estimates on what still can be issued - the value of 6 is 
 unrealistically high.
This would get more complicated if we go by decoder capacity in bdver3. As we 
have two hardware decoders in steamroller (bdver3), they have a capacity to 
decode eight instructions per clock cycle, providing up to twice the decode and 
dispatch bandwidth compared to bdver1.
If we model this in GCC we need to change the issue rate to 8. If 6 is high, 
then 8 would add more joy and excitement.

TARGET_SCHED_VARIABLE_ISSUE is a nice suggestion to schedule instructions in 
different way. 

 We also should enable ia32_multipass_dfa_lookahead - with that scheduler 
 should be able to put double decoded and vector decoded insns on the proper 
 places.
Yes. Whenever we have this scheduler analysis in place we discuss about this 
but unfortunately is left as it is.
I will look into this after I do the enablement for bdver4.

 I will work on replacing most of the CPU cases into tuning flags + costs.
I am planning to get bdver4 enablement in place once scheduler descriptions for 
bdver3 is done with.
I will have cycles to look into the cost models. Please delegate some tasks if 
you can and I am willing to take them up.

Regards
Ganesh

-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Tuesday, October 08, 2013 3:20 PM
To: Gopalasubramanian, Ganesh
Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 
chips

 Hi Honza,
 
 I am planning to update the scheduler descriptions for bdver3 first.
 Attached is the patch. Please let me know your comments if any.
 
 Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and 
 decoding schemes are different. So, let me know how can I approach merging 
 these.

Yep, I think we need to merge only those autmatas tha are same for both:
(define_automaton bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu)
probably can become
(define_automaton bdver3,bdver3_fp)
with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu 
changed to bdver1
automaton.  I think it should result in smaller binary - the fact that all 
conditionals are
physically duplicated in bdver1/bdev3.md should be optimized away by 
genautomata.

I also played a bit with the decoders and I am attaching my version - that 
seems SPEC neutral though.
Your version has problem that it does not model the thing that the two decoders 
works sequentially.

I removed the bdver1-decodev unit and instead i simply reserve all thre 
decoders + I added 
presence set requring second decoder to be taken only after first one changed 
presence set requring
decoder 2 to be taken only after decoder 1+2 to final presence set, so

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-08 Thread Gopalasubramanian, Ganesh
Hi Honza,

I am planning to update the scheduler descriptions for bdver3 first.
Attached is the patch. Please let me know your comments if any.

Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and 
decoding schemes are different. So, let me know how can I approach merging 
these.

Regards
Ganesh

-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Monday, September 30, 2013 4:47 PM
To: gcc-patches@gcc.gnu.org; Gopalasubramanian, Ganesh; hjl.to...@gmail.com
Subject: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips


Hi,
while looking into schedules produced for Buldozer and Core I noticed that they 
do not seem to match reality.  This is because ix86_issue_rate limits those 
CPUs into 3 instructions per cycle, while they are designed to do 4 and 
somewhat confused ix86_adjust_cost.

I also added stack engine into modern chips even though scheduler doesn't 
really understand that multiple push operations can happen in one cycle. At 
least it gets the stack updates in sequences of push/pop operations.

I did not updated buldozer issue rates yet.  The current scheduler model won't 
allow it to execute more than 3 instructions per cycle (and 2 for version 3).  
I think bdver1.md/bdver3.md needs to be updated first.

I am testing x86_64-linux and will commit if there are no complains.

Honza

* i386.c (ix86_issue_rate): Pentium4/Nocona issue 2 instructions
per cycle, Core/CoreI7/Haswell 4 instructions per cycle.
(ix86_adjust_cost): Add stack engine to modern AMD chips;
fix for core; remove Atom that mistakely shared code with AMD.
Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 203011)
+++ config/i386/i386.c  (working copy)
@@ -24435,17 +24435,14 @@ ix86_issue_rate (void)
 case PROCESSOR_SLM:
 case PROCESSOR_K6:
 case PROCESSOR_BTVER2:
+case PROCESSOR_PENTIUM4:
+case PROCESSOR_NOCONA:
   return 2;
 
 case PROCESSOR_PENTIUMPRO:
-case PROCESSOR_PENTIUM4:
-case PROCESSOR_CORE2:
-case PROCESSOR_COREI7:
-case PROCESSOR_HASWELL:
 case PROCESSOR_ATHLON:
 case PROCESSOR_K8:
 case PROCESSOR_AMDFAM10:
-case PROCESSOR_NOCONA:
 case PROCESSOR_GENERIC:
 case PROCESSOR_BDVER1:
 case PROCESSOR_BDVER2:
@@ -24453,6 +24450,11 @@ ix86_issue_rate (void)
 case PROCESSOR_BTVER1:
   return 3;
 
+case PROCESSOR_CORE2:
+case PROCESSOR_COREI7:
+case PROCESSOR_HASWELL:
+  return 4;
+
 default:
   return 1;
 }
@@ -24709,10 +24711,15 @@ ix86_adjust_cost (rtx insn, rtx link, rt
 case PROCESSOR_BDVER3:
 case PROCESSOR_BTVER1:
 case PROCESSOR_BTVER2:
-case PROCESSOR_ATOM:
 case PROCESSOR_GENERIC:
   memory = get_attr_memory (insn);
 
+  /* Stack engine allows to execute pushpop instructions in parall.  */
+  if (((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
+   (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP))
+  (ix86_tune != PROCESSOR_ATHLON  ix86_tune != PROCESSOR_K8))
+   return 0;
+
   /* Show ability of reorder buffer to hide latency of load by executing
 in parallel with previous instruction in case
 previous instruction is not needed to compute the address.  */ @@ 
-24737,6 +24744,29 @@ ix86_adjust_cost (rtx insn, rtx link, rt
  else
cost = 0;
}
+  break;
+
+case PROCESSOR_CORE2:
+case PROCESSOR_COREI7:
+case PROCESSOR_HASWELL:
+  memory = get_attr_memory (insn);
+
+  /* Stack engine allows to execute pushpop instructions in parall.  */
+  if ((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
+  (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP))
+   return 0;
+
+  /* Show ability of reorder buffer to hide latency of load by executing
+in parallel with previous instruction in case
+previous instruction is not needed to compute the address.  */
+  if ((memory == MEMORY_LOAD || memory == MEMORY_BOTH)
+  !ix86_agi_dependent (dep_insn, insn))
+   {
+ if (cost = 4)
+   cost -= 4;
+ else
+   cost = 0;
+   }
   break;
 
 case PROCESSOR_SLM:


issue_rate_bdver3.patch
Description: issue_rate_bdver3.patch


RE: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-30 Thread Gopalasubramanian, Ganesh
 1. For cmp/test with rip-relative addressing mem operand, don't group 
 insns. Bulldozer also doesn't support fusion for cmp/test with both 
 displacement MEM and immediate operand, while m_CORE_ALL doesn't 
 support fusion for cmp/test with MEM and immediate operand. I simplify 
 choose to use the more stringent constraint here (m_CORE_ALL's 
 constraint).

This suits Bulldozer's specification. We don't see an issue with the proposed 
patch.

Regards
Ganesh

-Original Message-
From: H.J. Lu [mailto:hjl.to...@gmail.com] 
Sent: Wednesday, September 25, 2013 2:12 AM
To: Wei Mi
Cc: Jan Hubicka; Alexander Monakov; Steven Bosscher; GCC Patches; David Li; 
Kirill Yukhin
Subject: Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

On Tue, Sep 24, 2013 at 12:06 PM, Wei Mi w...@google.com wrote:
 This is the updated patch2.
 Changed:
 1. For cmp/test with rip-relative addressing mem operand, don't group 
 insns. Bulldozer also doesn't support fusion for cmp/test with both 
 displacement MEM and immediate operand, while m_CORE_ALL doesn't 
 support fusion for cmp/test with MEM and immediate operand. I simplify 
 choose to use the more stringent constraint here (m_CORE_ALL's 
 constraint).
 2. Add Budozer back and merge TARGET_FUSE_CMP_AND_BRANCH_64 and 
 TARGET_FUSE_CMP_AND_BRANCH_32.

 bootstrap and regression pass. ok for trunk?

 2013-09-24  Wei Mi  w...@google.com

 * gcc/config/i386/i386.c (rip_relative_addr_p): New Function.
 (ix86_macro_fusion_p): Ditto.
 (ix86_macro_fusion_pair_p): Ditto.
 * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
 * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
 * gcc/doc/tm.texi: Generated.
 * gcc/doc/tm.texi.in: Ditto.
 * gcc/haifa-sched.c (try_group_insn): New Function.
 (group_insns_for_macro_fusion): Ditto.
 (sched_init): Call group_insns_for_macro_fusion.
 * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
 a SCHED_GROUP at the end of BB to remain their location.
 * gcc/target.def: Add two hooks: macro_fusion_p and
 macro_fusion_pair_p.

 diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 
 1fd3f60..4a04778 100644
 --- a/gcc/config/i386/i386.c
 +++ b/gcc/config/i386/i386.c
 @@ -24856,6 +24856,167 @@ ia32_multipass_dfa_lookahead (void)
  }
  }

 +/* Extracted from ix86_print_operand_address. Check whether ADDR is a
 +   rip-relative address.  */
 +
 +static bool
 +rip_relative_addr_p (rtx addr)
 +{
 +  struct ix86_address parts;
 +  rtx base, index, disp;
 +  int ok;
 +
 +  if (GET_CODE (addr) == UNSPEC  XINT (addr, 1) == UNSPEC_VSIBADDR)
 +{
 +  ok = ix86_decompose_address (XVECEXP (addr, 0, 0), parts);
 +  parts.index = XVECEXP (addr, 0, 1);
 +}
 +  else if (GET_CODE (addr) == UNSPEC  XINT (addr, 1) == UNSPEC_LEA_ADDR)
 +ok = ix86_decompose_address (XVECEXP (addr, 0, 0), parts);  else
 +ok = ix86_decompose_address (addr, parts);
 +
 +  gcc_assert (ok);
 +  base = parts.base;
 +  index = parts.index;
 +  disp = parts.disp;
 +
 +  if (TARGET_64BIT  !base  !index)
 +{
 +  rtx symbol = disp;
 +
 +  if (GET_CODE (disp) == CONST
 +  GET_CODE (XEXP (disp, 0)) == PLUS
 +  CONST_INT_P (XEXP (XEXP (disp, 0), 1)))
 +   symbol = XEXP (XEXP (disp, 0), 0);
 +
 +  if (GET_CODE (symbol) == LABEL_REF
 + || (GET_CODE (symbol) == SYMBOL_REF
 +  SYMBOL_REF_TLS_MODEL (symbol) == 0))
 +   return true;
 +}
 +  if (flag_pic  !base  !index)
 +{
 +  if (GET_CODE (disp) == CONST
 +  GET_CODE (XEXP (disp, 0)) == UNSPEC
 +  (XINT (XEXP (disp, 0), 1) == UNSPEC_PCREL
 + || XINT (XEXP (disp, 0), 1) == UNSPEC_GOTPCREL
 + || (TARGET_64BIT
 +  XINT (XEXP (disp, 0), 1) == UNSPEC_GOTNTPOFF)))
 +   return true;
 +}
 +  return false;
 +}
 +

It doesn't look right.  IP relative address is only possible with TARGET_64BIT 
and

1. base == pc. Or
2. UUNSPEC_PCREL,  UNSPEC_GOTPCREL, and
NSPEC_GOTNTPOFF.


--
H.J.



RE: [PATCH,i386] Default alignment for AMD BD and BT

2013-08-01 Thread Gopalasubramanian, Ganesh
Thanks Jakub!

Committed revision 201402.

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Thursday, July 04, 2013 4:46 PM
To: Gopalasubramanian, Ganesh
Cc: Uros Bizjak (ubiz...@gmail.com); gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT

On Thu, Jul 04, 2013 at 11:14:24AM +, Gopalasubramanian, Ganesh wrote:
 Can this be backported now!

Yes.

Jakub




RE: [PATCH,i386] Default alignment for AMD BD and BT

2013-07-04 Thread Gopalasubramanian, Ganesh
Hi Uros,

Can this be backported now!

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Thursday, May 30, 2013 1:40 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT

On Wed, May 29, 2013 at 1:28 PM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 We want this to be backported to GCC48 branch.
 Please approve.

 -Original Message-
 From: Uros Bizjak [mailto:ubiz...@gmail.com]
 Sent: Tuesday, May 07, 2013 6:22 PM
 To: Gopalasubramanian, Ganesh
 Cc: gcc-patches@gcc.gnu.org
 Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT

 On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh 
 ganesh.gopalasubraman...@amd.com wrote:

 The patch updates the alignment values for AMD BD and BT architectures.

 make -k check passes.

 Is it OK for upstream?

 2013-05-07  Ganesh Gopalasubramanian
 ganesh.gopalasubraman...@amd.com

 * config/i386/i386.c (processor_target_table): Modified default
 alignment values for AMD BD and BT architectures.

This is OK, but please wait until 4.8 branch is open again.

Thanks,
Uros.




RE: [PATCH,i386] FP Reassociation for AMD bdver1 and bdver2

2013-05-29 Thread Gopalasubramanian, Ganesh
Thanks Uros! Committed at r199405.

-Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Thursday, May 23, 2013 4:47 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] FP Reassociation for AMD bdver1 and bdver2

On Thu, May 23, 2013 at 1:11 PM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 The patch enables FP Reassociation pass AMD bdver1 and bdver2 architectures.
 We note a performance uplift of around ~8% on calculix.

 make -k check passes.

 Is it OK for upstream?

OK.

Thanks,
Uros.




RE: [PATCH,i386] Default alignment for AMD BD and BT

2013-05-29 Thread Gopalasubramanian, Ganesh
Hi

We want this to be backported to GCC48 branch.
Please approve.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Tuesday, May 07, 2013 6:22 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT

On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 The patch updates the alignment values for AMD BD and BT architectures.

 make -k check passes.

 Is it OK for upstream?

 2013-05-07  Ganesh Gopalasubramanian  
 ganesh.gopalasubraman...@amd.com

 * config/i386/i386.c (processor_target_table): Modified default
 alignment values for AMD BD and BT architectures.

The value 11 indeed looks a bit weird, but it means: align to 16 byte boundary 
only if this can be done by skipping 10 bytes or less.

Tha patch is OK for mainline.

Thanks,
Uros.




RE: [PATCH, i386]: Update processor_alias_table for missing PTA_PRFCHW and PTA_FXSR flags

2013-05-16 Thread Gopalasubramanian, Ganesh
Thank you Uros for the patch. Could you backport this to the 4.8.0?

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Wednesday, May 15, 2013 11:16 PM
To: gcc-patches@gcc.gnu.org
Cc: Gopalasubramanian, Ganesh
Subject: [PATCH, i386]: Update processor_alias_table for missing PTA_PRFCHW and 
PTA_FXSR flags

Hello!

Attached patch adds missing PTA_PRFCHW and PTA_FXSR flags to x86 processor 
alias table. PRFCHW CPUID flag is shared with 3dnow prefetch flag, so some 
additional logic is needed to avoid generating SSE prefetches for non-SSE 
3dNow! targets, while still generating full set of 3dnow prefetches on 3dNow! 
targets.

2013-05-15  Uros Bizjak  ubiz...@gmail.com

* config/i386/i386.c (iy86_option_override_internal): Update
processor_alias_table for missing PTA_PRFCHW and PTA_FXSR flags.  Add
PTA_POPCNT to corei7 entry and remove PTA_SSE from athlon-4 entry.
Do not enable SSE prefetch on non-SSE 3dNow! targets.  Enable
TARGET_PRFCHW for TARGET_3DNOW targets.
* config/i386/i386.md (prefetch): Enable for TARGET_PRFCHW instead
of TARGET_3DNOW.
(*prefetch_3dnow): Enable for TARGET_PRFCHW only.

Patch was bootstrapped and regression tested on x86_64-pc-linux-gnu {,-m32}  
and was committed to mainline SVN. The patch will be backported to 4.8 branch 
in a couple of days.

Uros.



RE: [PATCH,i386] FSGSBASE for AMD bdver3

2013-05-15 Thread Gopalasubramanian, Ganesh
Thank you Uros!

Patch for FSGSBASE instruction generation for AMD bdver3 committed to trunk 
(rr198916).

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Monday, May 13, 2013 5:50 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] FSGSBASE for AMD bdver3

On Mon, May 13, 2013 at 1:54 PM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 The patch enables FSGSBASE instruction generation for AMD bdver3 
 architectures.

 make -k check passes.

 Is it OK for upstream?

OK. Please also check for missing PTA_PRFCHW and PTA_FXSR for AMD processors in 
processor_alias_table.

Thanks,
Uros.




RE: [PATCH,i386] FSGSBASE for AMD bdver3

2013-05-14 Thread Gopalasubramanian, Ganesh
Thanks Uros!

I think you mean the amdfam10 ISA mismatch between march=native and 
march=amdfam10.
The below patch fills the gap.

make -k check passes.

Regards
Ganesh

2013-05-07  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com

* config/i386/i386.c (processor_alias_table): Mismatch in ISAs
Between march=native and march=amdfam10 is fixed.

--- ./wkcpy/gcc-4.9.0/gcc/config/i386/i386.c  2013-02-21 16:27:10.0 
+0530
+++ ./source/gcc-4.9.0/gcc/config/i386/i386.c   2013-10-21 22:20:28.0 
+0530
@@ -2964,7 +2964,8 @@
| PTA_SSE2 | PTA_NO_SAHF},
   {amdfam10, PROCESSOR_AMDFAM10, CPU_AMDFAM10,
PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE
-   | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM},
+   | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM
+   | PTA_FXSR | PTA_PRFCHW},
   {barcelona, PROCESSOR_AMDFAM10, CPU_AMDFAM10,
PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE
| PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM},


-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Monday, May 13, 2013 5:50 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] FSGSBASE for AMD bdver3

On Mon, May 13, 2013 at 1:54 PM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 The patch enables FSGSBASE instruction generation for AMD bdver3 
 architectures.

 make -k check passes.

 Is it OK for upstream?

OK. Please also check for missing PTA_PRFCHW and PTA_FXSR for AMD processors in 
processor_alias_table.

Thanks,
Uros.




RE: [PATCH,i386] Default alignment for AMD BD and BT

2013-05-13 Thread Gopalasubramanian, Ganesh
Thank you Uros!

Committed r198820. 

Regards
Ganesh


-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Tuesday, May 07, 2013 6:22 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT

On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 The patch updates the alignment values for AMD BD and BT architectures.

 make -k check passes.

 Is it OK for upstream?

 2013-05-07  Ganesh Gopalasubramanian  
 ganesh.gopalasubraman...@amd.com

 * config/i386/i386.c (processor_target_table): Modified default
 alignment values for AMD BD and BT architectures.

The value 11 indeed looks a bit weird, but it means: align to 16 byte boundary 
only if this can be done by skipping 10 bytes or less.

Tha patch is OK for mainline.

Thanks,
Uros.




[PATCH,i386] FSGSBASE for AMD bdver3

2013-05-13 Thread Gopalasubramanian, Ganesh
Hi 

The patch enables FSGSBASE instruction generation for AMD bdver3 architectures.

make -k check passes.

Is it OK for upstream?

Regards
Ganesh

Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 198821)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2013-05-13  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
+
+* config/i386/i386.c (processor_alias_table): Add instruction
+FSGSBASE for AMD bdver3 architecture.
+
 2013-05-13  Martin Jambor  mjam...@suse.cz

PR middle-end/42371
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 198821)
+++ gcc/config/i386/i386.c  (working copy)
@@ -3000,7 +3000,7 @@
| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
| PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
| PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE
-   | PTA_XSAVEOPT},
+   | PTA_XSAVEOPT | PTA_FSGSBASE},
   {btver1, PROCESSOR_BTVER1, CPU_GENERIC64,
PTA_64BIT | PTA_MMX |  PTA_SSE  | PTA_SSE2 | PTA_SSE3
| PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW



[PATCH,i386] Default alignment for AMD BD and BT

2013-05-07 Thread Gopalasubramanian, Ganesh
Hi 

The patch updates the alignment values for AMD BD and BT architectures.

make -k check passes.

Is it OK for upstream?

Regards
Ganesh

2013-05-07  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com

* config/i386/i386.c (processor_target_table): Modified default 
alignment values for AMD BD and BT architectures.

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 198386)
+++ gcc/config/i386/i386.c  (working copy)
@@ -2450,11 +2450,11 @@
   {generic32_cost, 16, 7, 16, 7, 16},
   {generic64_cost, 16, 10, 16, 10, 16},
   {amdfam10_cost, 32, 24, 32, 7, 32},
-  {bdver1_cost, 32, 24, 32, 7, 32},
-  {bdver2_cost, 32, 24, 32, 7, 32},
-  {bdver3_cost, 32, 24, 32, 7, 32},
-  {btver1_cost, 32, 24, 32, 7, 32},
-  {btver2_cost, 32, 24, 32, 7, 32},
+  {bdver1_cost, 16, 10, 16, 7, 11},
+  {bdver2_cost, 16, 10, 16, 7, 11},
+  {bdver3_cost, 16, 10, 16, 7, 11},
+  {btver1_cost, 16, 10, 16, 7, 11},
+  {btver2_cost, 16, 10, 16, 7, 11},
   {atom_cost, 16, 15, 16, 7, 16}
 };



RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores

2013-02-14 Thread Gopalasubramanian, Ganesh
Thank you Gerald!

Committed with the changes. 

Regards
Ganesh

-Original Message-
From: Gerald Pfeifer [mailto:ger...@pfeifer.com] 
Sent: Thursday, February 14, 2013 2:40 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patchesUros Bizjak
Subject: RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores

On Thu, 14 Feb 2013, Gopalasubramanian, Ganesh wrote:
 Is it OK for wwdocs?

Looks good to me if you say ...through the... options (adding the in two 
cases) and breaking the lines to not exceed 76 columns.

Thanks,
Gerald




[patch][wwwdocs] gcc 4.8 changes - AMD new cores

2013-02-13 Thread Gopalasubramanian, Ganesh
Hello, 

This patch adds short words about the new AMD cores that got enabled in GCC-4.8.
OK for the wwwdocs? 

Regards
Ganesh

Index: gcc-4.8/changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v
retrieving revision 1.96
diff -u -r1.96 changes.html
--- gcc-4.8/changes.html12 Feb 2013 16:33:58 -  1.96
+++ gcc-4.8/changes.html13 Feb 2013 08:24:53 -
@@ -529,6 +529,10 @@
 information.
 /li
 li Windows MinGW-w64 targets (code*-w64-mingw*/code) require at 
least r5437 from the Mingw-w64 trunk. /li
+liSupport for new AMD family 15h processors (Steamroller core) is now 
available
+  through code-march=bdver3/code and code-mtune=bdver3/code 
options./li
+liSupport for new AMD family 16h processors (Jaguar core) is now 
available
+  through code-march=btver2/code and code-mtune=btver2/code 
options./li
   /ul

 h3 id=frvFRV/h3



RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores

2013-02-13 Thread Gopalasubramanian, Ganesh
Is it OK for wwdocs?

Index: gcc-4.8/changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v
retrieving revision 1.96
diff -u -r1.96 changes.html
--- gcc-4.8/changes.html12 Feb 2013 16:33:58 -  1.96
+++ gcc-4.8/changes.html13 Feb 2013 08:24:53 -
@@ -529,6 +529,10 @@
 information.
 /li
 li Windows MinGW-w64 targets (code*-w64-mingw*/code) require at 
least r5437 from the Mingw-w64 trunk. /li
+liSupport for new AMD family 15h processors (Steamroller core) is now 
available
+  through code-march=bdver3/code and code-mtune=bdver3/code 
options./li
+liSupport for new AMD family 16h processors (Jaguar core) is now 
available
+  through code-march=btver2/code and code-mtune=btver2/code 
options./li
   /ul

 h3 id=frvFRV/h3

-Original Message-
From: Mikael Morin [mailto:mikael.mo...@sfr.fr] 
Sent: Wednesday, February 13, 2013 6:38 PM
To: Richard Biener
Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org; ubizjak at gmail dot 
com (gcc-bugzi...@gcc.gnu.org); ger...@pfeifer.com
Subject: Re: [patch][wwwdocs] gcc 4.8 changes - AMD new cores

Le 13/02/2013 14:00, Richard Biener a écrit :
 Of course not.  Next they'll add blver ...
Sorry




RE: [PATCH, i386]: AMD bdver3 enablement

2012-11-15 Thread Gopalasubramanian, Ganesh
Thank Uros for the comments.

The changes are committed to trunk
http://gcc.gnu.org/viewcvs?view=revisionrevision=193548
http://gcc.gnu.org/viewcvs?view=revisionrevision=193549

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Wednesday, November 14, 2012 4:15 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, i386]: AMD bdver3 enablement

On Wed, Nov 14, 2012 at 10:22 AM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 sseshuf replaces sselog in some insn patterns, but should be handled in the 
 same way in *existing* .md files.

 Modifications done as per the comments.
 1. Sseshuf is added along with sselog in existing md files.
 2. sseshuf is handled in a separate pattern in bdver3.md

 Bootstrapping and make -k check passes.
 Ok for trunk?

 2012-11-14  Ganesh Gopalasubramanian  
 ganesh.gopalasubraman...@amd.com

 bdver3 Enablement
 * gcc/doc/extend.texi: Add details about bdver3.
 * gcc/doc/invoke.texi: Add details about bdver3.
 * config.gcc (i[34567]86-*-linux* | ...): Add bdver3.
 (case ${target}): Add bdver3.
 * config/i386/i386.h (TARGET_BDVER3): New definition.
 * config/i386/i386.md (define_attr cpu): Add bdver3.
 * config/i386/sse.md (sseshuf): New type attribute.
 * config/i386/athlon.md (sseshuf):Likewise.
 * config/i386/atom.md (sseshuf):Likewise.
 * config/i386/ppro.md (sseshuf):Likewise.
 * config/i386/bdver1.md (sseshuf):Likewise.
 * config/i386/i386.opt (flag_dispatch_scheduler): Add bdver3.
 * config/i386/i386-c.c (ix86_target_macros_internal): Add
 bdver3 def_and_undef
 * config/i386/driver-i386.c (host_detect_local_cpu): Let
 -march=native recognize bdver3 processors.
 * config/i386/i386.c (struct processor_costs bdver3_cost): New.
 (m_BDVER3): New definition.
 (m_AMD_MULTIPLE): Includes m_BDVER3.
 (initial_ix86_tune_features): Add bdver3 tune.
 (processor_target_table): Add bdver3 entry.
 (static const char *const cpu_names): Add bdver3 entry.
 (software_prefetching_beneficial_p): Add bdver3.
 (ix86_option_override_internal): Add bdver3 instruction sets.
 (ix86_option_override_internal): Remove XSAVEOPT for bdver1
 and bdver2.
 (ix86_issue_rate): Add bdver3.
 (ix86_adjust_cost): Add bdver3.
 (enum target_cpu_default): Add TARGET_CPU_DEFAULT_bdver3.
 (enum processor_type): Add PROCESSOR_BDVER3.
 * config/i386/bdver3.md: New file describing bdver3 pipelines.

OK for mainline.

Thanks,
Uros.




RE: [PATCH, i386]: AMD bdver3 enablement

2012-11-11 Thread Gopalasubramanian, Ganesh
 You can see from the changes of sse.md that this is functionally a no-op 
 change.
Sseshuf replaces sselog.
So, do you mean it should be added with sselog instead of sseadd?
Adding it with sseadd (instead of sselog) influences the latency information.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Monday, November 12, 2012 2:30 AM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, i386]: AMD bdver3 enablement

On Fri, Nov 9, 2012 at 4:39 AM, Gopalasubramanian, Ganesh
ganesh.gopalasubraman...@amd.com wrote:

 Changes done with respect to the review comments.
 Conditionally setting sseshuf type attribute has been removed.
 Instead new attribute is added and is included for other attribute 
 calculations.

 The patch is attached as (difflog.txt).
 The new file (bdver3.md) describing the pipelines is also attached.

 Bootstrapping and make -k check passes.

 OK for upstream?

 2012-11-09  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com

 bdver3 Enablement
 * gcc/doc/extend.texi: Add details about bdver3.
 * gcc/doc/invoke.texi: Add details about bdver3.
 * config.gcc (i[34567]86-*-linux* | ...): Add bdver3.
 (case ${target}): Add bdver3.
 * config/i386/i386.h (TARGET_BDVER3): New definition.
 * config/i386/i386.md (define_attr cpu): Add bdver3.
 * config/i386/sse.md (sseshuf): New type attribute.
 * config/i386/athlon.md (sseshuf):Likewise.
 * config/i386/atom.md (sseshuf):Likewise.
 * config/i386/ppro.md (sseshuf):Likewise.

Index: gcc/config/i386/atom.md
===
--- gcc/config/i386/atom.md (revision 193132)
+++ gcc/config/i386/atom.md (working copy)
@@ -455,6 +455,30 @@
 (eq_attr memory !none)))
   atom-simple-0)

+(define_insn_reservation  atom_sseshuf 1
+  (and (eq_attr cpu atom)
+   (and (eq_attr type sseshuf)
+(eq_attr memory none)))
+  atom-simple-either)
+
+(define_insn_reservation  atom_sseshuf_mem 1
+  (and (eq_attr cpu atom)
+   (and (eq_attr type sseshuf)
+(eq_attr memory !none)))
+  atom-simple-either)
+
+(define_insn_reservation  atom_sseshuf1 1
+  (and (eq_attr cpu atom)
+   (and (eq_attr type sseshuf1)
+(eq_attr memory none)))
+  atom-simple-0)
+
+(define_insn_reservation  atom_sseshuf1_mem 1
+  (and (eq_attr cpu atom)
+   (and (eq_attr type sseshuf1)
+(eq_attr memory !none)))
+  atom-simple-0)
+
 ;; not pmad, not psad
 (define_insn_reservation  atom_sseiadd 1
   (and (eq_attr cpu atom)

This was not what I had in mind for changes in existing .md files.
Just change them in this way:

Index: atom.md
===
--- atom.md (revision 193407)
+++ atom.md (working copy)
@@ -594,7 +594,7 @@
 ;; no memory simple
 (define_insn_reservation  atom_sseadd 5
   (and (eq_attr cpu atom)
-   (and (eq_attr type sseadd,sseadd1)
+   (and (eq_attr type sseadd,sseshuf,sseadd1,sseshuf1)
 (and (eq_attr memory none)
  (and (eq_attr mode !V2DF)
   (eq_attr atom_unit !complex)
@@ -603,7 +603,7 @@
 ;; memory simple
 (define_insn_reservation  atom_sseadd_mem 5
   (and (eq_attr cpu atom)
-   (and (eq_attr type sseadd,sseadd1)
+   (and (eq_attr type sseadd,sseshuf,sseadd1,sseshuf1)
 (and (eq_attr memory !none)
  (and (eq_attr mode !V2DF)
   (eq_attr atom_unit !complex)
@@ -612,7 +612,7 @@
 ;; maxps, minps, *pd, hadd, hsub
 (define_insn_reservation  atom_sseadd_3 8
   (and (eq_attr cpu atom)
-   (and (eq_attr type sseadd,sseadd1)
+   (and (eq_attr type sseadd,sseshuf,sseadd1,sseshuf1)
 (ior (eq_attr mode V2DF) (eq_attr atom_unit complex
   atom-complex, atom-all-eu*7)

You can see from the changes of sse.md that this is functionally a no-op change.

Uros.




RE: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon

2012-10-30 Thread Gopalasubramanian, Ganesh
Hi Jakub,

We are working on the following. 
1. bdver3 enablement. Review completed. Changes to be incorporated and 
checked-in.
http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01131.html

2. btver2 basic enablement is done 
(http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01018.html)/
Scheduler descriptions are being updated. This is architecture specific and we 
consider it not to be a stage-1 material.

Regards
Ganesh

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Monday, October 29, 2012 11:27 PM
To: g...@gcc.gnu.org
Cc: gcc-patches@gcc.gnu.org
Subject: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon

Status
==

I'd like to close the stage 1 phase of GCC 4.8 development on Monday, November 
5th.  If you have still patches for new features you'd like to see in GCC 4.8, 
please post them for review soon.  Patches posted before the freeze, but 
reviewed shortly after the freeze, may still go in, further changes should be 
just bugfixes and documentation fixes.


Quality Data


Priority  #   Change from Last Report
---   ---
P1   23   + 23
P2   77   +  8
P3   85   + 84
---   ---
Total   185   +115


Previous Report
===

http://gcc.gnu.org/ml/gcc/2012-03/msg00011.html

The next report will be sent by me again, announcing end of stage 1.



Add myself to MAINTAINERS

2012-10-30 Thread Gopalasubramanian, Ganesh
Adding myself to the list of members in write after approval.

Index: ChangeLog
===
--- ChangeLog   (revision 192977)
+++ ChangeLog   (working copy)
@@ -1,3 +1,7 @@
+2012-10-30 Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
+
+   * MAINTAINERS (Write After Approval): Add myself.
+
 2012-10-26  James Greenhalgh  james.greenha...@arm.com

* MAINTAINERS (Write After Approval): Add myself.
Index: MAINTAINERS
===
--- MAINTAINERS (revision 192977)
+++ MAINTAINERS (working copy)
@@ -372,6 +372,7 @@
 Chao-ying Fu   f...@mips.com
 Gary Funck g...@intrepid.com
 Pompapathi V Gadad pompapathi.v.ga...@nsc.com
+Gopalasubramanian Ganesh   ganesh.gopalasubraman...@amd.com
 Kaveh Ghazigh...@gcc.gnu.org
 Matthew Gingellging...@gnat.com
 Tristan Gingoldging...@adacore.com

Regards
Ganesh



RE: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced

2012-10-10 Thread Gopalasubramanian, Ganesh
That was obvious. Sorry for the wrong commit.

Thanks Jakub.

-Ganesh

-Original Message-
From: Paolo Carlini [mailto:paolo.carl...@oracle.com] 
Sent: Wednesday, October 10, 2012 4:33 PM
To: Jakub Jelinek
Cc: Gopalasubramanian, Ganesh; Uros Bizjak; gcc-patches@gcc.gnu.org; 
veku...@gcc.gnu.org
Subject: Re: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine 
is reduced

On 10/10/2012 01:00 PM, Jakub Jelinek wrote:
 I have removed the extra line as obvious in SVN, to allow my 
 bootstraps to continue.
Thanks!

Paolo.





RE: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced

2012-10-03 Thread Gopalasubramanian, Ganesh
Testing was done before posting the patch. It was successful.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Thursday, September 27, 2012 5:57 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine 
is reduced

On Thu, Sep 27, 2012 at 10:30 AM, Gopalasubramanian, Ganesh 
ganesh.gopalasubraman...@amd.com wrote:

 This is a fix for PR 51109.

 There are three changes

 1.  Microcoded instructions are considered as single issue instructions 
 and are therefore issued to a separate execution unit.
 2.  The multiplier unit is attached to execution unit 1 (ieu1). Since ieu 
 is handled as a separate automaton in the patch, separate mult automaton is 
 not required.
 3.  The integer execution units (2AGUs and 2EXs) are now decoupled. Now, 
 they are described as separate automatons.

 Is it OK for upstream?

 Regards
 Ganesh

 2012-09-27  Ganesh Gopalasubramanian  
 ganesh.gopalasubraman...@amd.com

 PR 51109
 * gcc/config/i386/bdver1.md (bdver1_int): Automaton has been
 split to reduce state transitions.

OK for mainline, if tested according to [1].

[1] http://gcc.gnu.org/contribute.html#testing

Thanks,
Uros.




[PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced

2012-09-27 Thread Gopalasubramanian, Ganesh
Hi All,

This is a fix for PR 51109.

There are three changes

1.  Microcoded instructions are considered as single issue instructions and 
are therefore issued to a separate execution unit.
2.  The multiplier unit is attached to execution unit 1 (ieu1). Since ieu 
is handled as a separate automaton in the patch, separate mult automaton is 
not required.
3.  The integer execution units (2AGUs and 2EXs) are now decoupled. Now, 
they are described as separate automatons.

Is it OK for upstream?

Regards
Ganesh

2012-09-27  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com

PR 51109
* gcc/config/i386/bdver1.md (bdver1_int): Automaton has been 
split to reduce state transitions.
   
Index: gcc/config/i386/bdver1.md
===
--- gcc/config/i386/bdver1.md   (revision 191658)
+++ gcc/config/i386/bdver1.md   (working copy)
@@ -36,7 +36,7 @@
 (define_attr bdver1_decode direct,vector,double
   (const_string direct))

-(define_automaton bdver1,bdver1_int,bdver1_load,bdver1_mult,bdver1_fp)
+(define_automaton bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu)

 (define_cpu_unit bdver1-decode0 bdver1)
 (define_cpu_unit bdver1-decode1 bdver1)
@@ -71,16 +71,14 @@
 | (nothing,(bdver1-decode1 + 
bdver1-decode2


-(define_cpu_unit bdver1-ieu0 bdver1_int)
-(define_cpu_unit bdver1-ieu1 bdver1_int)
+(define_cpu_unit bdver1-ieu0 bdver1_ieu)
+(define_cpu_unit bdver1-ieu1 bdver1_ieu)
 (define_reservation bdver1-ieu (bdver1-ieu0 | bdver1-ieu1))

-(define_cpu_unit bdver1-agu0 bdver1_int)
-(define_cpu_unit bdver1-agu1 bdver1_int)
+(define_cpu_unit bdver1-agu0 bdver1_agu)
+(define_cpu_unit bdver1-agu1 bdver1_agu)
 (define_reservation bdver1-agu (bdver1-agu0 | bdver1-agu1))

-(define_cpu_unit bdver1-mult bdver1_mult)
-
 (define_cpu_unit bdver1-load0 bdver1_load)
 (define_cpu_unit bdver1-load1 bdver1_load)
 (define_reservation bdver1-load bdver1-agu,
@@ -93,6 +91,12 @@
 ;; 128bit SSE instructions issue two stores at once.
 (define_reservation bdver1-store2 (bdver1-load0 + bdver1-load1))

+;; vectorpath (microcoded) instructions are single issue instructions.
+;; So, they occupy all the integer units.
+(define_reservation bdver1-ivector bdver1-ieu0+bdver1-ieu1+
+  bdver1-agu0+bdver1-agu1+
+  bdver1-load0+bdver1-load1)
+
 ;; The FP operations start to execute at stage 12 in the pipeline, while
 ;; integer operations start to execute at stage 9 for athlon and 11 for K8
 ;; Compensate the difference for athlon because it results in significantly
@@ -125,7 +129,7 @@
 (define_insn_reservation bdver1_call 0
 (and (eq_attr cpu bdver1,bdver2)
  (eq_attr type call,callv))
-bdver1-double,bdver1-agu,bdver1-ieu)
+bdver1-double,bdver1-agu)
 ;; PUSH mem is double path.
 (define_insn_reservation bdver1_push 1
 (and (eq_attr cpu bdver1,bdver2)
@@ -135,17 +139,17 @@
 (define_insn_reservation bdver1_pop 1
 (and (eq_attr cpu bdver1,bdver2)
  (eq_attr type pop))
-bdver1-direct,(bdver1-ieu+bdver1-load))
+bdver1-direct,bdver1-ivector)
 ;; LEAVE no latency info so far, assume same with amdfam10.
 (define_insn_reservation bdver1_leave 3
 (and (eq_attr cpu bdver1,bdver2)
  (eq_attr type leave))
-bdver1-vector,(bdver1-ieu+bdver1-load))
+bdver1-vector,bdver1-ivector)
 ;; LEA executes in AGU unit with 1 cycle latency on BDVER1.
 (define_insn_reservation bdver1_lea 1
 (and (eq_attr cpu bdver1,bdver2)
  (eq_attr type lea))
-bdver1-direct,bdver1-agu,nothing)
+bdver1-direct,bdver1-agu)

 ;; MUL executes in special multiplier unit attached to IEU1.
 (define_insn_reservation bdver1_imul_DI 6
@@ -153,23 +157,23 @@
  (and (eq_attr type imul)
   (and (eq_attr mode DI)
(eq_attr memory none,unknown
-
bdver1-direct1,bdver1-ieu1,bdver1-mult,nothing,bdver1-ieu1)
+bdver1-direct1,bdver1-ieu1)
 (define_insn_reservation bdver1_imul 4
 (and (eq_attr cpu bdver1,bdver2)
  (and (eq_attr type imul)
   (eq_attr memory none,unknown)))
-bdver1-direct1,bdver1-ieu1,bdver1-mult,bdver1-ieu1)
+bdver1-direct1,bdver1-ieu1)
 (define_insn_reservation bdver1_imul_mem_DI 10
 (and (eq_attr cpu bdver1,bdver2)
  (and (eq_attr type imul)
 

RE: [PATCH,i386] fma4 addition for bdver2

2012-09-09 Thread Gopalasubramanian, Ganesh
Hi,

The second change (done in config/i386/driver-i386.c (host_detect_local_cpu)) 
is not reflected in svn revision 191109.
Since we are handling the fma instruction selection in i386.c\i386.md, we need 
not disable the flag in driver.

Let me know your opinion.

Regards
Ganesh

-Original Message-
From: Gopalasubramanian, Ganesh 
Sent: Wednesday, September 05, 2012 3:41 PM
To: gcc-patches@gcc.gnu.org
Cc: Uros Bizjak (ubiz...@gmail.com)
Subject: [PATCH,i386] fma4 addition for bdver2 

Hello,

FMA4 and FMA3 ISA are implemented in bdver2 target.
FMA3 is selected by default. 
This patch supports the use of FMA4 intrinsics for bdver2 targets.

Is it OK for trunk?

Regards
Ganesh

2012-09-05  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com

* config/i386/i386.md : Comments on fma4 instruction 
selection reflect requirement on register pressure based
cost model.
   
* config/i386/driver-i386.c (host_detect_local_cpu): fma4
flag is set-reset as informed by the cpuid flag.

* config/i386/i386.c (processor_alias_table): fma4
flag is enabled for bdver2.

Index: gcc/config/i386/i386.md
===
--- gcc/config/i386/i386.md (revision 190830)
+++ gcc/config/i386/i386.md (working copy)
@@ -659,9 +659,11 @@
 (eq_attr isa noavx2) (symbol_ref !TARGET_AVX2)
 (eq_attr isa bmi2) (symbol_ref TARGET_BMI2)
 (eq_attr isa fma) (symbol_ref TARGET_FMA)
-;; Disable generation of FMA4 instructions for generic code
-;; since FMA3 is preferred for targets that implement both
-;; instruction sets.
+;; Fma instruction selection has to be done based on
+;; register pressure. For generating fma4, a cost model
+;; based on register pressure is required. Till then,
+;; fma4 instruction is disabled for targets that implement
+;; both fma and fma4 instruction sets.
 (eq_attr isa fma4)
   (symbol_ref TARGET_FMA4  !TARGET_FMA)
]
Index: gcc/config/i386/driver-i386.c
===
--- gcc/config/i386/driver-i386.c   (revision 190830)
+++ gcc/config/i386/driver-i386.c   (working copy)
@@ -483,8 +483,6 @@
   has_abm = ecx  bit_ABM;
   has_lwp = ecx  bit_LWP;
   has_fma4 = ecx  bit_FMA4;
-  if (vendor == SIG_AMD  has_fma4  has_fma)
-   has_fma4 = 0;
   has_xop = ecx  bit_XOP;
   has_tbm = ecx  bit_TBM;
   has_lzcnt = ecx  bit_LZCNT;
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 190830)
+++ gcc/config/i386/i386.c  (working copy)
@@ -3164,7 +3164,7 @@
   {bdver2, PROCESSOR_BDVER2, CPU_BDVER2,
PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
+   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4
| PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
| PTA_FMA},
   {btver1, PROCESSOR_BTVER1, CPU_GENERIC64,

Regards
Ganesh



[PATCH,i386] fma4 addition for bdver2

2012-09-05 Thread Gopalasubramanian, Ganesh
Hello,

FMA4 and FMA3 ISA are implemented in bdver2 target.
FMA3 is selected by default. 
This patch supports the use of FMA4 intrinsics for bdver2 targets.

Is it OK for trunk?

Regards
Ganesh

2012-09-05  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com

* config/i386/i386.md : Comments on fma4 instruction 
selection reflect requirement on register pressure based
cost model.
   
* config/i386/driver-i386.c (host_detect_local_cpu): fma4
flag is set-reset as informed by the cpuid flag.

* config/i386/i386.c (processor_alias_table): fma4
flag is enabled for bdver2.

Index: gcc/config/i386/i386.md
===
--- gcc/config/i386/i386.md (revision 190830)
+++ gcc/config/i386/i386.md (working copy)
@@ -659,9 +659,11 @@
 (eq_attr isa noavx2) (symbol_ref !TARGET_AVX2)
 (eq_attr isa bmi2) (symbol_ref TARGET_BMI2)
 (eq_attr isa fma) (symbol_ref TARGET_FMA)
-;; Disable generation of FMA4 instructions for generic code
-;; since FMA3 is preferred for targets that implement both
-;; instruction sets.
+;; Fma instruction selection has to be done based on
+;; register pressure. For generating fma4, a cost model
+;; based on register pressure is required. Till then,
+;; fma4 instruction is disabled for targets that implement
+;; both fma and fma4 instruction sets.
 (eq_attr isa fma4)
   (symbol_ref TARGET_FMA4  !TARGET_FMA)
]
Index: gcc/config/i386/driver-i386.c
===
--- gcc/config/i386/driver-i386.c   (revision 190830)
+++ gcc/config/i386/driver-i386.c   (working copy)
@@ -483,8 +483,6 @@
   has_abm = ecx  bit_ABM;
   has_lwp = ecx  bit_LWP;
   has_fma4 = ecx  bit_FMA4;
-  if (vendor == SIG_AMD  has_fma4  has_fma)
-   has_fma4 = 0;
   has_xop = ecx  bit_XOP;
   has_tbm = ecx  bit_TBM;
   has_lzcnt = ecx  bit_LZCNT;
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 190830)
+++ gcc/config/i386/i386.c  (working copy)
@@ -3164,7 +3164,7 @@
   {bdver2, PROCESSOR_BDVER2, CPU_BDVER2,
PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
+   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4
| PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
| PTA_FMA},
   {btver1, PROCESSOR_BTVER1, CPU_GENERIC64,

Regards
Ganesh



RE: [PATCH,i386] fma,fma4 and xop flags

2012-08-16 Thread Gopalasubramanian, Ganesh
 This won't work, since we have to prefer FMA3 also in case when only -mfma 
 -mfma4 without -mtune=XX is used. 
 We can add TARGET_FMA_BOTH though, but I doubt there will ever be target that 
 implements both insn sets without preferences.

Preferring FMA3 over FMA4 might not do good always. For instance, with 
increased register pressure FMA3 can be used. 
But, when we have more registers at our disposal, fma4 if used might do good by 
avoiding extra reload.
IMO, when preference of FMA instructions is adjudged by register pressure,  we 
may need some functionality to support that.

So, ideally for bdver2, we like to have both fma and fma4 getting generated 
with options -mfma -mfma4.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Tuesday, August 14, 2012 9:12 PM
To: Richard Henderson
Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] fma,fma4 and xop flags

On Mon, Aug 13, 2012 at 9:50 PM, Richard Henderson r...@redhat.com wrote:
 On 08/13/2012 12:33 PM, Uros Bizjak wrote:
 AFAIU fma3 is better than fma4 for bdver2 (the only CPU that 
 implements both FMA sets). Current description of bdver2 doesn't even 
 enable fma4 in processor_alias_table due to this fact.

 The change you are referring to adds preference for fma3 insn set for 
 generic code (not FMA4 builtins!), even when fma4 is enabled. So, no 
 matter which combination and sequence of -mfmfa -mfma4 or -mxop user 
 passes to the compiler, only fma3 instructions will be generated.

 This rationale needs to appear as a comment above

 +  (eq_attr isa fma4)
 +(symbol_ref TARGET_FMA4  !TARGET_FMA)

I plan to commit following patch:

--cut here--
Index: i386.md
===
--- i386.md (revision 190362)
+++ i386.md (working copy)
@@ -659,6 +659,9 @@
 (eq_attr isa noavx2) (symbol_ref !TARGET_AVX2)
 (eq_attr isa bmi2) (symbol_ref TARGET_BMI2)
 (eq_attr isa fma) (symbol_ref TARGET_FMA)
+;; Disable generation of FMA4 instructions for generic code
+;; since FMA3 is preferred for targets that implement both
+;; instruction sets.
 (eq_attr isa fma4)
   (symbol_ref TARGET_FMA4  !TARGET_FMA)
]
--cut here--

 Longer term we may well require some sort of

   (TARGET_FMA4  !(TARGET_FMA  TARGET_PREFER_FMA3))

 with an appropriate entry in ix86_tune_features to match.

This won't work, since we have to prefer FMA3 also in case when only -mfma 
-mfma4 without -mtune=XX is used. We can add TARGET_FMA_BOTH though, but I 
doubt there will ever be target that implements both insn sets without 
preferences.

Uros.




[PATCH,i386] cpuid function for prefetchw

2012-08-13 Thread Gopalasubramanian, Ganesh
Hello,

To get the prefetchw cpuid flag, cpuid
function 0x8001 needs to be called.
Previous to patch, function 0x7 is called.

Bootstrapping and make -k check passes without failures.
Ok for trunk?

Regards
Ganesh

2012-08-13 Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com

PR driver/54210
* config/i386/driver-i386.c (host_detect_local_cpu): Call
cpuid function 0x8001 to get the prfchw cpuid flag. 

Index: gcc/config/i386/driver-i386.c
===
--- gcc/config/i386/driver-i386.c   (revision 189996)
+++ gcc/config/i386/driver-i386.c   (working copy)
@@ -467,7 +467,6 @@
   has_bmi2 = ebx  bit_BMI2;
   has_fsgsbase = ebx  bit_FSGSBASE;
   has_rdseed = ebx  bit_RDSEED;
-  has_prfchw = ecx  bit_PRFCHW;
 }

   /* Check cpuid level of extended features.  */ @@ -491,6 +490,7 @@
   has_longmode = edx  bit_LM;
   has_3dnowp = edx  bit_3DNOWP;
   has_3dnow = edx  bit_3DNOW;
+  has_prfchw = ecx  bit_PRFCHW;
 }

   if (!arch



RE: [PATCH,i386] cpuid function for prefetchw

2012-08-13 Thread Gopalasubramanian, Ganesh
Yes! Thanks Jakub.

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Monday, August 13, 2012 3:16 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] cpuid function for prefetchw

On Mon, Aug 13, 2012 at 09:29:45AM +, Gopalasubramanian, Ganesh wrote:
 To get the prefetchw cpuid flag, cpuid function 0x8001 needs to be 
 called.
 Previous to patch, function 0x7 is called.
 
 Bootstrapping and make -k check passes without failures.
 Ok for trunk?

IMHO you move it to a wrong spot, ecx bits of CPUID 0x8001 are tested 
earlier.

So I think you want this instead (bootstrap/regtest in progress):

2012-08-13  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com
Jakub Jelinek  ja...@redhat.com

PR driver/54210
* config/i386/driver-i386.c (host_detect_local_cpu): Test bit_PRFCHW
bit of CPUID 0x8001 %ecx instead of CPUID 7 %ecx.
* config/i386/cpuid.h (bits_PRFCHW): Move definition to CPUID
0x8001 %ecx flags.

--- gcc/config/i386/driver-i386.c.jj2012-08-10 15:49:25.0 +0200
+++ gcc/config/i386/driver-i386.c   2012-08-13 11:30:14.570494736 +0200
@@ -467,7 +467,6 @@ const char *host_detect_local_cpu (int a
   has_bmi2 = ebx  bit_BMI2;
   has_fsgsbase = ebx  bit_FSGSBASE;
   has_rdseed = ebx  bit_RDSEED;
-  has_prfchw = ecx  bit_PRFCHW;
   has_adx = ebx  bit_ADX;
 }
 
@@ -488,6 +487,7 @@ const char *host_detect_local_cpu (int a
   has_xop = ecx  bit_XOP;
   has_tbm = ecx  bit_TBM;
   has_lzcnt = ecx  bit_LZCNT;
+  has_prfchw = ecx  bit_PRFCHW;
 
   has_longmode = edx  bit_LM;
   has_3dnowp = edx  bit_3DNOWP;
--- gcc/config/i386/cpuid.h.jj  2012-08-10 15:49:25.0 +0200
+++ gcc/config/i386/cpuid.h 2012-08-13 11:31:30.346494092 +0200
@@ -52,6 +52,7 @@
 #define bit_LAHF_LM(1  0)
 #define bit_ABM(1  5)
 #define bit_SSE4a  (1  6)
+#define bit_PRFCHW (1  8)
 #define bit_XOP (1  11)
 #define bit_LWP(1  15)
 #define bit_FMA4(1  16)
@@ -69,7 +70,6 @@
 #define bit_HLE(1  4)
 #define bit_AVX2   (1  5)
 #define bit_BMI2   (1  8)
-#define bit_PRFCHW (1  8)
 #define bit_RTM(1  11)
 #define bit_RDSEED (1  18)
 #define bit_ADX(1  19)


Jakub




RE: [PATCH,i386] fma,fma4 and xop flags

2012-08-12 Thread Gopalasubramanian, Ganesh
Thank you Uros, Richard!

I will confirm the test results in couple off days.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Saturday, August 11, 2012 3:54 AM
To: Richard Henderson
Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] fma,fma4 and xop flags

On Fri, Aug 10, 2012 at 10:02 PM, Richard Henderson r...@redhat.com wrote:
 On 2012-08-10 12:59, Uros Bizjak wrote:
 Actually, this is the problem you are trying to solve. The fma4 
 patterns are defined before fma3, so gcc prefers these.

 The Real Problem is that they should not be separate patterns.
 They should be a single pattern that selects alternatives via the 
 enabled isa.

2012-08-11  Uros Bizjak  ubiz...@gmail.com

* config/i386/i386.md (isa): Add fma and fma4.
(enabled): Handle fma and fma4.
* config/i386/sse.md (*fma_fmadd_mode): Merge *fma4_fmadd_mode.
(*fma_fmsub_mode): Merge *fma4_fmsub_mode.
(*fma_fnmadd_mode): Merge *fma4_fnmadd_mode.
(*fma_fnmsub_mode): Merge *fma4_fnmsub_mode.
(*fma_fmaddsub_mode): Merge *fma4_fmaddsub_mode.
(*fma_fmsubadd_mode): Merge *fma4_fmsubadd_mode.

Tested on x86_64-pc-linux-gnu {,-m32}, committed to mainline SVN.

I will wait a couple of days before backporting patches to 4.7, so please 
Ganesh, test mainline if everything is OK.

BTW: With this patch, we can enable PTA_FMA4 for bdver2 target.

Uros.



RE: [PATCH,i386] fma,fma4 and xop flags

2012-08-09 Thread Gopalasubramanian, Ganesh
 -mxop implies -mfma4, but reverse is not true.

I think this handling went in for bdver1.
But, with bdver2, we have both fma and fma4.
So for bdver2, -mxop should not be enabling one of them.

 if someone set -mfma4 together
 with -mfma on the command line, we should NOT disable selected ISA
 behind user's back

If both -mfma4 and -mfma are enabled, GCC outputs fma4 instructions.
This, I think is because fma4 instruction patterns are read before fma 
instruction patterns from the .md files.
So, enabling both -mfma4 and -mfma is not good for bdver2.

Moreover, if user tries to use, -mfma -mno-fma4 -mxop, the order in which these 
options are used becomes crucial. -mxop enables -mfma4 and by instruction 
patterns fma4 instructions gets listed in the assembly file.

For the below test,

double a,b,c,d;
int fn(){
a = b + c * d ;
return a;
}

#1) Using options -O2 -mno-fma4 -mfma -mxop outputs fma4. 
(vfmaddsdb(%rip), %xmm2, %xmm1, %xmm0)
#2) Using options -O2 -mfma -mno-fma4 -mxop outputs fma4.
(vfmaddsdb(%rip), %xmm2, %xmm1, %xmm0)
#3) Using options -mxop -mno-fma4 -mfma outpts fma.
(vfmadd132sd d(%rip), %xmm1, %xmm0)

As we see the order in which the options are used becomes crucial.
This is confusing.

I haven't really tested other implied options. But, I suspect
similar phenomenon in those cases too. 

IMO, we can directly go by the CPUID flags and enable the flags.
This will be a one to one mapping and leave the user with lot more liberty.
Please let me know your opinion.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Friday, August 10, 2012 1:21 AM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] fma,fma4 and xop flags

On Wed, Aug 8, 2012 at 1:31 PM,  ganesh.gopalasubraman...@amd.com wrote:

 Bdver2 cpu supports both fma and fma4 instructions.
 Previous to patch, option -mno-xop removes -mfma4.
 Similarly, option -mno-fma4 removes -mxop.

It looks to me that there is some misunderstanding. AFAICS:

-mxop implies -mfma4, but reverse is not true. Please see

#define OPTION_MASK_ISA_FMA4_SET \
  (OPTION_MASK_ISA_FMA4 | OPTION_MASK_ISA_SSE4A_SET \
   | OPTION_MASK_ISA_AVX_SET)
#define OPTION_MASK_ISA_XOP_SET \
  (OPTION_MASK_ISA_XOP | OPTION_MASK_ISA_FMA4_SET)

So, -mxop sets -mfma4, etc ..., but -mfma4 does NOT enable -mxop.

OTOH,

#define OPTION_MASK_ISA_FMA4_UNSET \
  (OPTION_MASK_ISA_FMA4 | OPTION_MASK_ISA_XOP_UNSET)
#define OPTION_MASK_ISA_XOP_UNSET OPTION_MASK_ISA_XOP

-mno-fma4 implies -mno-xop, but again reverse is not true. Thus,
-mno-xop does NOT imply -mno-fma4.

 So, the patch conditionally disables -mfma or -mfma4.
 Enabling -mxop is done by also checking -mfma.

Please note that conditional handling of ISA flags belongs to
ix86_option_override_internal. However, if someone set -mfma4 together
with -mfma on the command line, we should NOT disable selected ISA
behind user's back, in the same way as we don't disable anything with
-march=i386 -msse4. With -march=bdver2, we already marked that only
fma is supported, and if user selected -march=bdver2 -mfma4 on the
command line, we shouldn't disable anything.

Uros.




RE: [PATCH,i386] fma,fma4 and xop flags

2012-08-08 Thread Gopalasubramanian, Ganesh
 Otherwise, what does -mno-fma4 -mxop do?
 (it should enable both xop and fma4!)  what should -mfma4 -mno-xop do
 (it should disable both xop and fma4!).

Yes! that's what GCC does now.
Some flags are coupled (atleast for now).
For ex, -mno-sse4.2 -mavx enables both sse4.2 and avx 
whereas -mavx -mno-sse4.2 disables both.

Setting of the following are clubbed.
1) 3DNow sets MMX
2) SSE2 sets SSE
3) SSE3 sets SSE2
4) SSE4_1 sets SSE3
5) SSE4_2 sets SSE4_1
6) FMA sets AVX
7) AVX2 sets AVX
8) SSE4_A sets SSE3
9) FMA4 set SSE4_A and AVX
10) XOP sets FMA4
11) AES sets SSE2
12) PCLMUL sets SSE2
13) ABM sets POPCNT

Resetting is done in reversely (MMX resets 3DNOW).

IMO, if we have different cpuid flags, enabling\disabling 
the compiler flags depends on these cpuid flags directly. 
Adding subsets to them or tangling them together may give 
wrong results.

Please let me know your opinion.

Regards
Ganesh

-Original Message-
From: Richard Guenther [mailto:richard.guent...@gmail.com] 
Sent: Wednesday, August 08, 2012 5:12 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; ubiz...@gmail.com
Subject: Re: [PATCH,i386] fma,fma4 and xop flags

On Wed, Aug 8, 2012 at 1:31 PM,  ganesh.gopalasubraman...@amd.com wrote:
 Hello,

 Bdver2 cpu supports both fma and fma4 instructions.
 Previous to patch, option -mno-xop removes -mfma4.
 Similarly, option -mno-fma4 removes -mxop.

Eh?  Why's that?  I think we should disentangle -mxop and -mfma4
instead.  Otherwise, what does -mno-fma4 -mxop do?
(it should enable both xop and fma4!)  what should -mfma4 -mno-xop do
(it should disable both xop and fma4!).  All this is just confusing to
the user, even if in AMD documents XOP includes FMA4.

Richard.



Backport: fma3 instruction generation for 'march=native' in AMD processors

2012-05-09 Thread Gopalasubramanian, Ganesh
Hello,

Below is the patch that has been committed in trunk (Revision: 187075).
We like to backport it to GCC 4.7 branch as couple of AMD processors require 
this change for fma3 instruction generation.

Bootstrapping and testing are successful. 

Is it OK to commit in GCC 4.7 branch?

Regards
Ganesh

PATCH
=

  * config/i386/driver-i386.c (host_detect_local_cpu): Reset
  has_fma4 for AMD processors with both fma3 and fma4 support.

Index: config/i386/driver-i386.c
===
--- config/i386/driver-i386.c   (revision 186897)
+++ config/i386/driver-i386.c   (working copy)
@@ -472,6 +472,8 @@
   has_abm = ecx  bit_ABM;
   has_lwp = ecx  bit_LWP;
   has_fma4 = ecx  bit_FMA4;
+  if (vendor == SIG_AMD  has_fma4  has_fma)
+   has_fma4 = 0;
   has_xop = ecx  bit_XOP;
   has_tbm = ecx  bit_TBM;
   has_lzcnt = ecx  bit_LZCNT;



Re: [PATCH] [i386] fma3 instruction generation for 'march=native' in AMD processors

2012-05-03 Thread Gopalasubramanian, Ganesh
I have added the ChangeLog and modified the patch.

Is it OK to commit to trunk?

Regards
Ganesh

2012-05-03  Ganesh Gopalasubramanian  ganesh.gopalasubraman...@amd.com

* config/i386/driver-i386.c (host_detect_local_cpu): Reset
has_fma4 for AMD processors with both fma3 and fma4 support.

Index: config/i386/driver-i386.c
===
--- config/i386/driver-i386.c   (revision 186897)
+++ config/i386/driver-i386.c   (working copy)
@@ -472,6 +472,8 @@
   has_abm = ecx  bit_ABM;
   has_lwp = ecx  bit_LWP;
   has_fma4 = ecx  bit_FMA4;
+  if (vendor == SIG_AMD  has_fma4  has_fma)
+   has_fma4 = 0;
   has_xop = ecx  bit_XOP;
   has_tbm = ecx  bit_TBM;
   has_lzcnt = ecx  bit_LZCNT;

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Wednesday, May 02, 2012 5:11 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] [i386] fma3 instruction generation for 'march=native' in 
AMD processors

On Wed, May 02, 2012 at 11:12:33AM +, Gopalasubramanian, Ganesh wrote:
 For AMD architectures with both fma3 and fma4 instructions' support, GCC 
 generates fma4 by default. Instead, we like to generate fma3 instruction. 
 Below patch enables the fma3 instruction generation for -march=native.
 
 Ok for trunk?

You haven't provided ChangeLog entry.

 Index: gcc/config/i386/driver-i386.c
 ===
 --- gcc/config/i386/driver-i386.c   (revision 186897)
 +++ gcc/config/i386/driver-i386.c   (working copy)
 @@ -472,6 +472,10 @@
has_abm = ecx  bit_ABM;
has_lwp = ecx  bit_LWP;
has_fma4 = ecx  bit_FMA4;
 +  if (((vendor == SIG_AMD))  (has_fma4)  (has_fma))
 +{
 +has_fma4 = 0;
 +}

And the formatting of this is wrong, 4 unnecessary pairs of (),
one unnecessary {} pair, bad indentation of the has_fma4 = 0;
assignment (should use a tab).

Jakub




[PATCH] [i386] fma3 instruction generation for 'march=native' in AMD processors

2012-05-02 Thread Gopalasubramanian, Ganesh
For AMD architectures with both fma3 and fma4 instructions' support, GCC 
generates fma4 by default. Instead, we like to generate fma3 instruction. Below 
patch enables the fma3 instruction generation for -march=native.

Ok for trunk?

Index: gcc/config/i386/driver-i386.c
===
--- gcc/config/i386/driver-i386.c   (revision 186897)
+++ gcc/config/i386/driver-i386.c   (working copy)
@@ -472,6 +472,10 @@
   has_abm = ecx  bit_ABM;
   has_lwp = ecx  bit_LWP;
   has_fma4 = ecx  bit_FMA4;
+  if (((vendor == SIG_AMD))  (has_fma4)  (has_fma))
+{
+has_fma4 = 0;
+}
   has_xop = ecx  bit_XOP;
   has_tbm = ecx  bit_TBM;
   has_lzcnt = ecx  bit_LZCNT;

Regards
Ganesh