RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost
Note that before the fixes for PR64909 the epilogue/prologue loops had very large costs associated due to a bug in the cost model implementation. After the fix their cost is reasonable but the cost of the extra jumps is way under-accounted for due to the numbers for cond_taken_branch_cost and cond_not_taken_branch_cost. The proposes match mitigates that somewhat. Richard! The patch is good. We are done with our benchmarking and found no regressions. How did you arrive at the original cost model? The original cost model as you suspect is not based on architecture alone. Those are the numbers arrived at by analyzing benchmarks and the cost model bugs then. These initial numbers were copied for subsequent architectures too. Cost assignments saying scalar_stmt_cost = 6 and scalar load_cost = 4 doesn't make sense at all. We will have a look into it. Regards Ganesh -Original Message- From: Richard Biener [mailto:rguent...@suse.de] Sent: Wednesday, April 08, 2015 1:08 PM To: Gopalasubramanian, Ganesh Cc: Uros Bizjak; gcc-patches@gcc.gnu.org Subject: RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost On Wed, 8 Apr 2015, Gopalasubramanian, Ganesh wrote: I have added a person from AMD to comment on the decision. Otherwise, the patch looks OK, but please wait a couple of days for possible comments. Thank you Uros! I am checking the changes with few tests and benchmarking them. Please wait for a couple of days. Note that before the fixes for PR64909 the epilogue/prologue loops had very large costs associated due to a bug in the cost model implementation. After the fix their cost is reasonable but the cost of the extra jumps is way under-accounted for due to the numbers for cond_taken_branch_cost and cond_not_taken_branch_cost. The proposes match mitigates that somewhat. How did you arrive at the original cost model? Thanks, Richard. -- Richard Biener rguent...@suse.de SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Jennifer Guild, Dilip Upmanyu, Graham Norton HRB 21284 (AG Nuernberg)
RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost
I have added a person from AMD to comment on the decision. Otherwise, the patch looks OK, but please wait a couple of days for possible comments. Thank you Uros! I am checking the changes with few tests and benchmarking them. Please wait for a couple of days. -Ganesh
RE: [PATCH] Rename gimple_build_assign_with_ops to gimple_build_assign and swap the first two arguments of it
The following patch implements that. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Our aarch64 build also breaks as mentioned in https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00119.html Regards Ganesh
RE: [PATCH, aarch64] Add prefetch support
There's no point in the buffer or the sprintf. The text is short enough to repeat whole pattern in the array: Updated the patch for the above suggestions. Is it ok for upstream? Regards Ganesh prefetch.diff Description: prefetch.diff
RE: [PATCH, aarch64] Add prefetch support
Please ignore the previous patch sent. The attachment was wrong. There's no point in the buffer or the sprintf. The text is short enough to repeat whole pattern in the array: Updated the patch for the above suggestions. make -k check RUNTESTFLAGS=execute.exp compile.exp dg.exp passes. Is it ok for upstream? Regards Ganesh prefetch.diff Description: prefetch.diff
FW: [PATCH, aarch64] Add prefetch support
PING! I am worried if it goes in stage-1. -Original Message- From: Gopalasubramanian, Ganesh Sent: Thursday, October 30, 2014 2:24 PM To: gcc-patches@gcc.gnu.org Subject: [PATCH, aarch64] Add prefetch support Hi, Below is the patch that implements prefetching support. This patch has been already discussed on a) https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html b) https://gcc.gnu.org/ml/gcc-patches/2014-07/msg00612.html I have not added a test as there are ample tests in compile and execute suites. make -k check passes. Ok for trunk? Changelog: 2014-10-30 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/aarch64/aarch64.md (define_insn prefetch): New. * config/arm/types.md (define_attr type): Add prefetch. Regards Ganesh diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 74b554e..12a3f170 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -320,6 +320,38 @@ [(set_attr type no_insn)] ) + +(define_insn prefetch + [(prefetch (match_operand:DI 0 address_operand r) + (match_operand:QI 1 const_int_operand ) + (match_operand:QI 2 const_int_operand ))] + + * +{ + const char * pftype[2][10] + = { {\PLDL1STRM\, \PLDL3KEEP\, \PLDL2KEEP\, \PLDL1KEEP\}, + {\PSTL1STRM\, \PSTL3KEEP\, \PSTL2KEEP\, \PSTL1KEEP\}, + }; + + int locality = INTVAL (operands[2]); + char pattern[100]; + + gcc_assert (IN_RANGE (locality, 0, 3)); + + strcpy (pattern, \prfm\\t\); + strcat (pattern, (const char*)pftype[INTVAL(operands[1])][locality]); + strcat (pattern, \, %a0\); + + output_asm_insn (pattern, + operands); + + return \\; + +} + [(set_attr type prefetch)] +) + (define_insn trap [(trap_if (const_int 1) (const_int 8))] diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md index c1151f5..8b4b7a1 100644 --- a/gcc/config/arm/types.md +++ b/gcc/config/arm/types.md @@ -118,6 +118,7 @@ ; mvn_shift_reg inverting move instruction, shifted operand by a register. ; no_insn an insn which does not represent an instruction in the ; final output, thus having no impact on scheduling. +; prefetch a prefetch instruction ; rbit reverse bits. ; rev reverse bytes. ; sdiv signed division. @@ -556,6 +557,7 @@ call,\ clz,\ no_insn,\ + prefetch,\ csel,\ crc,\
[PATCH, aarch64] Add prefetch support
Hi, Below is the patch that implements prefetching support. This patch has been already discussed on a) https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html b) https://gcc.gnu.org/ml/gcc-patches/2014-07/msg00612.html I have not added a test as there are ample tests in compile and execute suites. make -k check passes. Ok for trunk? Changelog: 2014-10-30 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/aarch64/aarch64.md (define_insn prefetch): New. * config/arm/types.md (define_attr type): Add prefetch. Regards Ganesh diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 74b554e..12a3f170 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -320,6 +320,38 @@ [(set_attr type no_insn)] ) + +(define_insn prefetch + [(prefetch (match_operand:DI 0 address_operand r) + (match_operand:QI 1 const_int_operand ) + (match_operand:QI 2 const_int_operand ))] + + * +{ + const char * pftype[2][10] + = { {\PLDL1STRM\, \PLDL3KEEP\, \PLDL2KEEP\, \PLDL1KEEP\}, + {\PSTL1STRM\, \PSTL3KEEP\, \PSTL2KEEP\, \PSTL1KEEP\}, + }; + + int locality = INTVAL (operands[2]); + char pattern[100]; + + gcc_assert (IN_RANGE (locality, 0, 3)); + + strcpy (pattern, \prfm\\t\); + strcat (pattern, (const char*)pftype[INTVAL(operands[1])][locality]); + strcat (pattern, \, %a0\); + + output_asm_insn (pattern, + operands); + + return \\; + +} + [(set_attr type prefetch)] +) + (define_insn trap [(trap_if (const_int 1) (const_int 8))] diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md index c1151f5..8b4b7a1 100644 --- a/gcc/config/arm/types.md +++ b/gcc/config/arm/types.md @@ -118,6 +118,7 @@ ; mvn_shift_reg inverting move instruction, shifted operand by a register. ; no_insn an insn which does not represent an instruction in the ; final output, thus having no impact on scheduling. +; prefetch a prefetch instruction ; rbit reverse bits. ; rev reverse bytes. ; sdiv signed division. @@ -556,6 +557,7 @@ call,\ clz,\ no_insn,\ + prefetch,\ csel,\ crc,\
RE: RFA: another patch to fix PR61360
The r-x alternative results in vector decoding on amdfam10. This is AMD-speak for microcoded instructions, and AMD optimization manual strongly recommends avoiding them. I have CC'd Ganesh, maybe he can provide more relevant data on the performance impact. Thanks Uros! Yes, the AMD SWOG recommends precisely what Uros mentions. snip from SWOG for BD When moving data from a GPR to an XMM register, use separate store and load instructions to move the data first from the source register to a temporary location in memory and then from memory into the destination register /snip This is listed as an optimization too. This holds good for all amdfam10 and BD family processors. I have to dig through the performance numbers will try to get them. Regards Ganesh
[PATCH, i386] PR61360: Do not update enabled attribute during lra and reload passes
This patch fixes PR 61360. The attribute enabled should actually be used enable/disable alternative based on sub-targets. In this pattern, it gets used across passes too. However, modifying this attribute in LRA pass is not something it is meant for. This patch allows enabling/disabling the attribute when optimizing for size, but not during lra pass or reload pass. Bootstrap passes. OK for upstream? Regards Ganesh diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 6d91da0..3775f6e 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,4 +1,10 @@ -2014-08-22 David Malcolm dmalc...@redhat.com +2014-08-22 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com + + PR 61360 + * config/i386/i386.md (*floatSWI48:modeMODEF:mode2_sse): + Do not modify enabled attribute during LRA pass. + +014-08-22 David Malcolm dmalc...@redhat.com * cprop.c (struct occr): Strengthen field insn from rtx to rtx_insn *. diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 8e74eab..de2ecf0 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -4795,10 +4795,10 @@ /* ??? For sched1 we need constrain_operands to be able to select an alternative. Leave this enabled before RA. */ (symbol_ref TARGET_INTER_UNIT_CONVERSIONS - || optimize_function_for_size_p (cfun) - || !(reload_completed -|| reload_in_progress -|| lra_in_progress)) + || (optimize_function_for_size_p (cfun) +!(reload_completed +|| reload_in_progress +|| lra_in_progress))) ] (symbol_ref true))) ])
RE: [PATCH, i386] Remove use of vpmacsdql instruction from multiplication.
Hi Uros! +2014-06-10 Ganesh Gopalasubramanian +ganesh.gopalasubraman...@amd.com + + * config/i386/i386.c (ix86_expand_sse2_mulvxdi3): Issue +instructions vpmuludq and vpaddq instead of vpmacsdql for +handling 32-bit multiplication. OK for mainline and release branches. I would like to backport the above patch for 4.9. Is it OK? Regards Ganesh
RE: [PATCH, i386] Add RDRND and MOVBE for AMD bdver4
OK for mainline. Thanks Uros. Committed to revision 213572 I would like to backport to 4.9 branch too. Is it OK? - Ganesh
[PATCH, i386] Add RDRND and MOVBE for AMD bdver4
Below patch adds PTA_RDRND and PTA_MOVBE for bdver4. Bootstrap passes. Ok for upstream? Regards Ganesh Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 213568) +++ gcc/ChangeLog (working copy) @@ -24,6 +24,11 @@ 2014-08-04 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com +* gcc/config/i386/i386.c (ix86_option_override_internal): Add + PTA_RDRND and PTA_MOVBE for bdver4. + +2014-08-04 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com + * config/i386/driver-i386.c (host_detect_local_cpu): Handle AMD's extended family information. Handle BTVER2 cpu with cpuid family value. Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 213568) +++ gcc/config/i386/i386.c (working copy) @@ -3267,12 +3267,13 @@ ix86_option_override_internal (bool main | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE}, {bdver4, PROCESSOR_BDVER4, CPU_BDVER4, -PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 -| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1 -| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2 + PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 + | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1 + | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2 | PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_BMI2 | PTA_TBM | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR - | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE}, + | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE | PTA_RDRND + | PTA_MOVBE}, {btver1, PROCESSOR_BTVER1, CPU_GENERIC, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW
RE: [PATCH, i386] Handle extended family cpuid info for AMD
Uros! I would like to have a check for a family at the beginning, something like: if (name == signature_NSC_ebx) processor = PROCESSOR_GEODE; else if (family == 22) { if (has_movbe) I get your idea of having the family checked first and then differentiating with cpuid info later. But, this case is getting interesting because, BTVER1 and BTVER2 are two variants but don't really have same family numbers. BTVER1 is family 14h and BTVER2 is family 16h. I don't see near term plans for any additional cpus to either 14h or 16h. Given that fact, this particular check is applicable only for BTVER2. In that case, having else if (family == 22) if (has_movbe) processor = PROCESSOR_BTVER2; looks odd. Regards Ganesh
RE: [PATCH, i386] Handle extended family cpuid info for AMD
In this case, having only check for family ID should be enough. If BTVER1 and BTVER2 can be uniquely determined by their family IDs , IMO, this would be the most future-proof approach. Signature checks will override family id checks which will override cpuid checks. Thank you Uros! I have modified source only for BTVER2. The way BTVER1 is currently assigned to processor includes more than one family. So, I am leaving that unmoved. Bootstrap passes. Is it OK for trunk and backport to open branches. Regards -Ganesh diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 706fedc..202bd99 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2014-08-01 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com + + * config/i386/driver-i386.c (host_detect_local_cpu): Handle AMD's extended family + information. Handle BTVER2 cpu with cpuid family value. + 2014-07-31 James Greenhalgh james.greenha...@arm.com * config/aarch64/arm_neon.h (vpadd_suf8,16,32,64): Move to diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c index 1c6385f..0402c90 100644 --- a/gcc/config/i386/driver-i386.c +++ b/gcc/config/i386/driver-i386.c @@ -432,7 +432,8 @@ const char *host_detect_local_cpu (int argc, const char **argv) model = (eax 4) 0x0f; family = (eax 8) 0x0f; - if (vendor == signature_INTEL_ebx) + if ((vendor == signature_INTEL_ebx) || + (vendor == signature_AMD_ebx)) { unsigned int extended_model, extended_family; @@ -576,7 +577,7 @@ const char *host_detect_local_cpu (int argc, const char **argv) if (name == signature_NSC_ebx) processor = PROCESSOR_GEODE; - else if (has_movbe) + else if (family == 22) processor = PROCESSOR_BTVER2; else if (has_avx2) processor = PROCESSOR_BDVER4;
[PATCH, i386] Handle extended family cpuid info for AMD
Hi, The below patch handles the AMD's cpuid family information. With the information from cpuid, BTVER2 cpu for -march=native flag is handled. Bootstrap passes. Is it OK for trunk and branches? Regards Ganesh diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 6223bd6..3f8bb2c 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2014-07-31 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com + + * tree-sra.c (host_detect_local_cpu): Handle AMD's extended family + information. Handle BTVER2 cpu with cpuid family value. + 2014-07-30 Martin Jambor mjam...@suse.cz * tree-sra.c (sra_ipa_modify_assign): Change type of the first diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c index 1c6385f..21ae1f3 100644 --- a/gcc/config/i386/driver-i386.c +++ b/gcc/config/i386/driver-i386.c @@ -432,7 +432,8 @@ const char *host_detect_local_cpu (int argc, const char **argv) model = (eax 4) 0x0f; family = (eax 8) 0x0f; - if (vendor == signature_INTEL_ebx) + if ((vendor == signature_INTEL_ebx) || + (vendor == signature_AMD_ebx)) { unsigned int extended_model, extended_family; @@ -576,7 +577,7 @@ const char *host_detect_local_cpu (int argc, const char **argv) if (name == signature_NSC_ebx) processor = PROCESSOR_GEODE; - else if (has_movbe) + else if (has_movbe family == 22) processor = PROCESSOR_BTVER2; else if (has_avx2) processor = PROCESSOR_BDVER4;
RE: [PATCH, i386] Handle extended family cpuid info for AMD
But, looking to processor_alias_table in config/i386/i386.c, only PROCESSOR_BTVER2 defines PTA_MOVBE. According to this, the logic is already correct, so the patch is not needed. We are evaluating bdver4 cpu. Bdver4 also supports MOVBE. I will submit patch for bdver4 PTA after our evaluation. Ganesh.
RE: [PATCH, i386] Handle extended family cpuid info for AMD
Then just use: + else if (has_avx2) +processor = PROCESSOR_BDVER4; else if (has_movbe) processor = PROCESSOR_BTVER2; - else if (has_avx2) -processor = PROCESSOR_BDVER4; else if (has_xsaveopt) In that case, with earlier GCC versions where we don’t have bdver4 added, the fall back would be BTVER2, whereas a BD variant is more desirable. Ganesh
FW: [PATCH, aarch64] Add prefetch support
PING! -Original Message- From: Gopalasubramanian, Ganesh Sent: Sunday, July 06, 2014 2:12 AM To: gcc-patches@gcc.gnu.org Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com Subject: RE: [PATCH, aarch64] Add prefetch support PING! From: Gopalasubramanian, Ganesh Sent: Friday, July 04, 2014 5:57 AM To: gcc-patches@gcc.gnu.org Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com Subject: [PATCH, aarch64] Add prefetch support Hi, Attached is a patch that implements * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) This patch has been already discussed on https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html make -k check passes. Ok for trunk? Changelog 2014-07-04 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/aarch64/aarch64.md (define_insn *prefetch) (define_insn prefetch): New * config/aarch64/predicates.md (aarch64_prefetch_pimm) (aarch64_prefetch_unscaled): New. * config/arm/types.md (define_attr type): Add prefetch. Regards Ganesh
RE: [PATCH, aarch64] Add prefetch support
PING! From: Gopalasubramanian, Ganesh Sent: Friday, July 04, 2014 5:57 AM To: gcc-patches@gcc.gnu.org Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com Subject: [PATCH, aarch64] Add prefetch support Hi, Attached is a patch that implements * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) This patch has been already discussed on https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html make -k check passes. Ok for trunk? Changelog 2014-07-04 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/aarch64/aarch64.md (define_insn *prefetch) (define_insn prefetch): New * config/aarch64/predicates.md (aarch64_prefetch_pimm) (aarch64_prefetch_unscaled): New. * config/arm/types.md (define_attr type): Add prefetch. Regards Ganesh
[PATCH, aarch64] Add prefetch support
Hi, Attached is a patch that implements * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) This patch has been already discussed on https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html make -k check passes. Ok for trunk? Changelog 2014-07-04 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/aarch64/aarch64.md (define_insn *prefetch) (define_insn prefetch): New * config/aarch64/predicates.md (aarch64_prefetch_pimm) (aarch64_prefetch_unscaled): New. * config/arm/types.md (define_attr type): Add prefetch. Regards Ganesh prefetch.diff Description: prefetch.diff
[PATCH, i386] Remove use of vpmacsdql instruction from multiplication.
Hi, The below patch fixes the issue with 64-bit multiplication. The instruction vpmacsdql does signed 32-bit multiplication. For V2DImode, we require widened unsigned multiplication. So, replacing the vpmacsdql instruction with vpmuludq and vpaddq. This patch had been already discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52908 With required change in the test xop-imul64-vector.c, make check passes. Is it OK for upstream? Regards Ganesh diff --git a/gcc/ChangeLog b/gcc/ChangeLog index d0a1253..c158612 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,9 @@ +2014-06-10 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com + + * config/i386/i386.c (ix86_expand_sse2_mulvxdi3): Issue instructions +vpmuludq and vpaddq instead of vpmacsdql for handling 32-bit +multiplication. + 2014-06-07 Jan Hubicka hubi...@ucw.cz * cgraphunit.c (assemble_thunks_and_aliases): Expand thunks before diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 9105132..184d82d 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -45205,8 +45205,10 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) /* t4: ((B*E)+(A*F))32, ((D*G)+(C*H))32 */ emit_insn (gen_ashlv2di3 (t4, t3, GEN_INT (32))); - /* op0: (((B*E)+(A*F))32)+(B*F), (((D*G)+(C*H))32)+(D*H) */ - emit_insn (gen_xop_pmacsdql (op0, op1, op2, t4)); + /* Multiply lower parts and add all */ + t5 = gen_reg_rtx (V2DImode); + emit_insn (gen_vec_widen_umult_even_v4si (t5, gen_lowpart (V4SImode, op1), gen_lowpart (V4SImode, op2))); + op0 = expand_binop (mode, add_optab, t5, t4, op0, 1, OPTAB_DIRECT); } else { diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index a6913af..757d3e3 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-06-10 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com + + * gcc.target/i386/xop-imul64-vector.c: Remove the check for + vpmacsdql instruction. + 2014-06-07 Eric Botcazou ebotca...@adacore.com * gnat.dg/opt38.adb: New test. diff --git a/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c b/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c index fbf605f..fc8c880 100644 --- a/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c +++ b/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c @@ -33,4 +33,3 @@ int main () /* { dg-final { scan-assembler vpmulld } } */ /* { dg-final { scan-assembler vphadddq } } */ -/* { dg-final { scan-assembler vpmacsdql } } */
RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.
Hi Philipp, These changes look good to me. We'll try them out on the benchmarks that caused us to add prefetching in the first place. If you are OK, I would like to get these changes upstreamed. -Ganesh -Original Message- From: Dr. Philipp Tomsich [mailto:philipp.toms...@theobroma-systems.com] Sent: Friday, February 28, 2014 2:58 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; pins...@gmail.com Subject: Re: [AArch64 05/14] Add AArch64 'prefetch'-pattern. Ganesh, On 28 Feb 2014, at 10:13 , Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: I also have attached a patch that implements the following. * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) These changes look good to me. We'll try them out on the benchmarks that caused us to add prefetching in the first place. Best, Philipp.
RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.
With the locality value received in the instruction pattern, I think it would be safe to handle them in prefetch instruction. This helps especially AArch64 has prefetch instructions that can handle this locality. +(define_insn prefetch + [(prefetch (match_operand:DI 0 address_operand r) +(match_operand:QI 1 const_int_operand n) +(match_operand:QI 2 const_int_operand n))] + + * +{ + int locality = INTVAL (operands[2]); + + gcc_assert (IN_RANGE (locality, 0, 3)); + + if (locality == 0) + /* non temporal locality */ + return (INTVAL(operands[1])) ? \prfm\\tPSTL1STRM, [%0, #0]\ : \prfm\\tPLDL1STRM, [%0, #0]\; + + /* temporal locality */ + return (INTVAL(operands[1])) ? \prfm\\tPSTL%2KEEP, [%0, #0]\ : \prfm\\tPLDL%2KEEP, [%0, #0]\; +} + [(set_attr type prefetch)] +) + I also have attached a patch that implements * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) Regards Ganesh -Original Message- From: Philipp Tomsich [mailto:philipp.toms...@theobroma-systems.com] Sent: Wednesday, February 19, 2014 2:40 AM To: gcc-patches@gcc.gnu.org Cc: philipp.toms...@theobroma-systems.com Subject: [AArch64 05/14] Add AArch64 'prefetch'-pattern. --- gcc/config/aarch64/aarch64.md | 17 + gcc/config/arm/types.md | 2 ++ 2 files changed, 19 insertions(+) diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 99a6ac8..b972a1b 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -293,6 +293,23 @@ [(set_attr type no_insn)] ) +(define_insn prefetch + [(prefetch (match_operand:DI 0 register_operand r) +(match_operand:QI 1 const_int_operand n) +(match_operand:QI 2 const_int_operand n))] + + * +{ + if (INTVAL(operands[2]) == 0) + /* no temporal locality */ + return (INTVAL(operands[1])) ? \prfm\\tPSTL1STRM, [%0, #0]\ : +\prfm\\tPLDL1STRM, [%0, #0]\; + + /* temporal locality */ + return (INTVAL(operands[1])) ? \prfm\\tPSTL1KEEP, [%0, #0]\ : +\prfm\\tPLDL1KEEP, [%0, #0]\; } + [(set_attr type prefetch)] +) + (define_insn trap [(trap_if (const_int 1) (const_int 8))] diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md index cc39cd1..1d1280d 100644 --- a/gcc/config/arm/types.md +++ b/gcc/config/arm/types.md @@ -117,6 +117,7 @@ ; mvn_shift_reg inverting move instruction, shifted operand by a register. ; no_insnan insn which does not represent an instruction in the ;final output, thus having no impact on scheduling. +; prefetch a prefetch instruction ; rbit reverse bits. ; revreverse bytes. ; sdiv signed division. @@ -553,6 +554,7 @@ call,\ clz,\ no_insn,\ + prefetch,\ csel,\ crc,\ extend,\ -- 1.9.0 prefetchdiff.log Description: prefetchdiff.log
RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.
Avoided top-posting and resending. + /* temporal locality */ + return (INTVAL(operands[1])) ? \prfm\\tPSTL1KEEP, [%0, #0]\ : +\prfm\\tPLDL1KEEP, [%0, #0]\; } + [(set_attr type prefetch)] +) + With the locality value received in the instruction pattern, I think it would be safe to handle them in prefetch instruction. This helps especially AArch64 has prefetch instructions that can handle this locality. +(define_insn prefetch + [(prefetch (match_operand:DI 0 address_operand r) +(match_operand:QI 1 const_int_operand n) +(match_operand:QI 2 const_int_operand n))] + + * +{ + int locality = INTVAL (operands[2]); + + gcc_assert (IN_RANGE (locality, 0, 3)); + + if (locality == 0) + /* non temporal locality */ + return (INTVAL(operands[1])) ? \prfm\\tPSTL1STRM, [%0, #0]\ : \prfm\\tPLDL1STRM, [%0, #0]\; + + /* temporal locality */ + return (INTVAL(operands[1])) ? \prfm\\tPSTL%2KEEP, [%0, #0]\ : \prfm\\tPLDL%2KEEP, [%0, #0]\; +} + [(set_attr type prefetch)] +) + I also have attached a patch that implements the following. * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) Regards Ganesh prefetchdiff.log Description: prefetchdiff.log
FW: Non-temporal move
I could see storent pattern in x86 machine descriptions (in sse.md)., but internals doc don't mention it. Should we add a description about this in the internals doc? Regards Ganesh
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
I'm sorry I didn't notice previous conversation. Please install ASAP. Thanks Uros! Committed to revision 206210. - Ganesh
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Hi, (get_amd_cpu): Handle AMD_BOBCAT, AMD_JAGUAR, AMDFAM15H_BDVER2 and AMDFAM15H_BDVER3. As mentioned earlier, we would like to stick with BTVER1 and BTVER2 instead of using BOBCAT or JAGUAR. Attached patch does the changes. Regards Ganesh NameChange.patch Description: NameChange.patch
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
Please provide updated ChangeLog. --- gcc/ChangeLog (revision 206106) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,14 @@ +2013-12-19 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com + + * config/i386/i386.c: Include cfgloop.h. + (ix86_loop_memcount): New function. + (ix86_loop_unroll_adjust): New function. + (TARGET_LOOP_UNROLL_ADJUST): Define. + * config/i386/i386.h + (TARGET_ADJUST_UNROLL): Define. + * config/i386/x86-tune.def + (X86_TUNE_ADJUST_UNROLL): Define. + The function comment is missing. Maybe you should also describe magic number 32 here? Added the function comment. Otherwise, the patch looks OK. Thanks. Bootstrapping passes. Is it OK for upstream? BTW: Please avoid top-posting, see e.g. [1] for reasons... Sorry for the lapse. Will comply. Regards Ganesh unroll-adjust.patch Description: unroll-adjust.patch
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Sorry, I must have been looking at an older version, but as I said I already did enable it in the latest patch. (see http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01577.html ) Sorry for causing another revision but we would like to stick with btver1 and btver2 rather than BOBCAT or JAGUAR. Therefore the changes would be like Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 206065) +++ gcc/config/i386/i386.c (working copy) @@ -29965,9 +29965,14 @@ P_PROC_SSE4_2, P_POPCNT, P_AVX, +P_PROC_AVX, +P_FMA4, +P_XOP, +P_PROC_XOP, +P_FMA, +P_PROC_FMA, P_AVX2, -P_FMA, -P_PROC_FMA +P_PROC_AVX2 }; enum feature_priority priority = P_ZERO; @@ -29986,11 +29991,15 @@ {sse, P_SSE}, {sse2, P_SSE2}, {sse3, P_SSE3}, + {sse4a, P_SSE4_a}, {ssse3, P_SSSE3}, {sse4.1, P_SSE4_1}, {sse4.2, P_SSE4_2}, {popcnt, P_POPCNT}, {avx, P_AVX}, + {fma4, P_FMA4}, + {xop, P_XOP}, + {fma, P_FMA}, {avx2, P_AVX2} }; @@ -30044,25 +30053,49 @@ break; case PROCESSOR_COREI7_AVX: arg_str = corei7-avx; - priority = P_PROC_SSE4_2; + priority = P_PROC_AVX; break; +case PROCESSOR_HASWELL: + arg_str = core-avx2; + priority = P_PROC_AVX2; + break; case PROCESSOR_ATOM: arg_str = atom; priority = P_PROC_SSSE3; break; +case PROCESSOR_SLM: + arg_str = slm; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_AMDFAM10: arg_str = amdfam10h; priority = P_PROC_SSE4_a; break; +case PROCESSOR_BTVER1: + arg_str = btver1; + priority = P_PROC_SSE4_a; + break; +case PROCESSOR_BTVER2: + arg_str = btver2; + priority = P_PROC_AVX; + break; case PROCESSOR_BDVER1: arg_str = bdver1; - priority = P_PROC_FMA; + priority = P_PROC_XOP; break; case PROCESSOR_BDVER2: arg_str = bdver2; priority = P_PROC_FMA; break; - } +case PROCESSOR_BDVER3: + arg_str = bdver3; + priority = P_PROC_FMA; + break; +case PROCESSOR_BDVER4: + arg_str = bdver4; + priority = P_PROC_AVX2; + break; +} } cl_target_option_restore (global_options, cur_target); @@ -30922,9 +30955,13 @@ F_SSE2, F_SSE3, F_SSSE3, +F_SSE4_a, F_SSE4_1, F_SSE4_2, F_AVX, +F_FMA4, +F_XOP, +F_FMA, F_AVX2, F_MAX }; @@ -30943,6 +30980,10 @@ M_AMDFAM10H, M_AMDFAM15H, M_INTEL_SLM, +M_INTEL_COREI7_AVX, +M_INTEL_CORE_AVX2, +M_AMD_BTVER1, +M_AMD_BTVER2, M_CPU_SUBTYPE_START, M_INTEL_COREI7_NEHALEM, M_INTEL_COREI7_WESTMERE, @@ -30953,7 +30994,9 @@ M_AMDFAM15H_BDVER1, M_AMDFAM15H_BDVER2, M_AMDFAM15H_BDVER3, -M_AMDFAM15H_BDVER4 +M_AMDFAM15H_BDVER4, +M_INTEL_COREI7_IVYBRIDGE, +M_INTEL_CORE_HASWELL }; static struct _arch_names_table @@ -30971,11 +31014,17 @@ {corei7, M_INTEL_COREI7}, {nehalem, M_INTEL_COREI7_NEHALEM}, {westmere, M_INTEL_COREI7_WESTMERE}, + {corei7-avx, M_INTEL_COREI7_AVX}, {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {ivybridge, M_INTEL_COREI7_IVYBRIDGE}, + {core-avx2, M_INTEL_CORE_AVX2}, + {haswell, M_INTEL_CORE_HASWELL}, {amdfam10h, M_AMDFAM10H}, {barcelona, M_AMDFAM10H_BARCELONA}, {shanghai, M_AMDFAM10H_SHANGHAI}, {istanbul, M_AMDFAM10H_ISTANBUL}, + {btver1, M_AMD_BTVER1}, + {btver2, M_AMD_BTVER2}, {amdfam15h, M_AMDFAM15H}, {bdver1, M_AMDFAM15H_BDVER1}, {bdver2, M_AMDFAM15H_BDVER2}, @@ -30997,9 +31046,13 @@ {sse2, F_SSE2}, {sse3, F_SSE3}, {ssse3, F_SSSE3}, + {sse4a, F_SSE4_a}, {sse4.1, F_SSE4_1}, {sse4.2, F_SSE4_2}, {avx,F_AVX}, + {fma4, F_FMA4}, + {xop,F_XOP}, + {fma,F_FMA}, {avx2, F_AVX2} }; Index: libgcc/config/i386/cpuinfo.c === --- libgcc/config/i386/cpuinfo.c(revision 206065) +++ libgcc/config/i386/cpuinfo.c(working copy) @@ -62,6 +62,10 @@ AMDFAM10H, AMDFAM15H, INTEL_SLM, + INTEL_COREI7_AVX, + INTEL_CORE_AVX2, + AMD_BTVER1, + AMD_BTVER2, CPU_TYPE_MAX }; @@ -75,6 +79,10 @@ AMDFAM10H_ISTANBUL, AMDFAM15H_BDVER1, AMDFAM15H_BDVER2, + AMDFAM15H_BDVER3, + AMDFAM15H_BDVER4, +
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Ping! Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: Yes, I figured that was the original idea behind it, but the final family of the jaguar processors seems to have become 16h instead of 14h (bobcat) at some point. Yes. It is amdfam16h. I was supposed to pass on some comments on the patch. 1. Amdfam16h for Jaguar. 2. For Jaguar, the priority needs to be AVX (AVX got included into the Jaguar ISA). I have a doubt! What would be done if priority is set to F_FMA4 instead of F_XOP for bdver1? Regards Ganesh
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Yes, I changed that in the last patch, though I consider it momentarily problematic because you do not yet enable AVX with march=btver2 (AVX versions would currently be better than btver2 versions for a btver2 arch), but expect march=btver2 will be fixed soon. The processor_alias_table entry for btver2 in i386.c enables AVX. snip {btver2, PROCESSOR_BTVER2, CPU_BTVER2, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_BMI | PTA_F16C | PTA_MOVBE | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT}, /snip The assembly listing for a simple test (compiled with -march=btver2) also has -mavx enabled. So, can you please enable AVX for btver2? Regards Ganesh
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that an old term for what has become the Jaguar architecture? Yes, btver2 = jaguar. We have the name as per its family name (i.e, bobcat family) in GCC. Similarly we have the names bdver2 = piledriver, bdver3 = steamroller as per their family (bulldozer) name. Regards Ganesh -Original Message- From: Allan Sandfeld Jensen [mailto:carew...@gmail.com] Sent: Monday, December 16, 2013 12:25 AM To: Uros Bizjak Cc: gcc-patches@gcc.gnu.org Subject: Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning Hi again On Wednesday 11 December 2013, Uros Bizjak wrote: Hello! PR gcc/59422 This patch extends the supported targets for function multi versiong to also include Haswell, Silvermont, and the most recent AMD models. It also prioritizes AVX2 versions over AMD specific pre-AVX2 versions. Please add a ChangeLog entry and attach the complete patch. Please also state how you tested the patch, as outlined in the instructions [1]. [1] http://gcc.gnu.org/contribute.html Updated patch for better CPU model detection and added ChangeLog. The patch has been tested with the attached test.cpp. Verified that it doesn't build before the patch, and that it builds after, and verified it selects correct versions at runtime based on either CPU model or supported ISA (tested on 3 machines: SandyBridge, IvyBridge and Phenom II). Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that an old term for what has become the Jaguar architecture? `Allan
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Yes, I figured that was the original idea behind it, but the final family of the jaguar processors seems to have become 16h instead of 14h (bobcat) at some point. Yes. It is amdfam16h. I was supposed to pass on some comments on the patch. 1. Amdfam16h for Jaguar. 2. For Jaguar, the priority needs to be AVX (AVX got included into the Jaguar ISA). I have a doubt! What would be done if priority is set to F_FMA4 instead of F_XOP for bdver1? Regards Ganesh
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
Hi Uros! Accommodated the changes that you mentioned. Completed the bootstrap testing too. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Wednesday, December 04, 2013 3:17 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Richard Guenther richard.guent...@gmail.com (richard.guent...@gmail.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Dec 4, 2013 at 9:39 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: Attached is the revised patch. The target independent part has been already approved and added. This revision of the patch adds a x86 tune definition and checks it while deciding the unroll factor. Accommodated the comments given by you except one. *x will never be null for active insns. Since every rtx in the insn is checked for memory references, the NULL_RTX check is required. Yes you are correct. for_each_rtx also passes NULL_RTX, I was distracted by There are no sub-expressions. comment. +if (NONDEBUG_INSN_P (insn) INSN_CODE (insn) != -1) Do you need to check for INSN_CODE here? IIRC, checking for NONDEBUG_INSN_P is enough. +for_each_rtx (insn, (rtx_function) ix86_loop_memcount, mem_count); +} + free (bbs); + + if (mem_count =32) +return 32/mem_count; Ouch... mem_count can be zero. Is there a reason to change this part from previous patch? Uros. unroll-adjust.patch Description: unroll-adjust.patch
[patch][wwwdocs] gcc 4.9 changes - AMD new cores
Hello, This patch adds details about new AMD cores that got enabled in GCC-4.9. OK for the wwwdocs? Regards Ganesh cvs diff: Diffing . Index: changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.9/changes.html,v retrieving revision 1.44 diff -r1.44 changes.html 404a405,407 liSupport for new AMD family 15h processors (Excavator core) is now available through the code-march=bdver4/code and code-mtune=bdver4/code options./li
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
Hi Uros! Attached is the revised patch. The target independent part has been already approved and added. This revision of the patch adds a x86 tune definition and checks it while deciding the unroll factor. Accommodated the comments given by you except one. *x will never be null for active insns. Since every rtx in the insn is checked for memory references, the NULL_RTX check is required. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Friday, November 22, 2013 1:46 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Richard Guenther richard.guent...@gmail.com (richard.guent...@gmail.com); borntrae...@de.ibm.com; H.J. Lu (hjl.to...@gmail.com); Jakub Jelinek (ja...@redhat.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Nov 20, 2013 at 7:26 PM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: Steamroller processors contain a loop predictor and a loop buffer, which may make unrolling small loops less important. When unrolling small loops for steamroller, making the unrolled loop fit in the loop buffer should be a priority. This patch uses a heuristic approach (number of memory references) to decide the unrolling factor for small loops. This patch has some noise in SPEC 2006 results. Bootstrapping passes. I would like to know your comments before committing. Please split the patch to target-dependant and target-independant part, and get target-idependant part reviewed first. This part: + if (ix86_tune != PROCESSOR_BDVER3 ix86_tune != PROCESSOR_BDVER4) + { +return nunroll; + } is wrong. You should introduce tune variable (as H.J. suggested) and check that variable here. Target dependant tuning options should be in x86-tune.def, so everything regarding tuning can be found in one place. +if (INSN_P (insn) INSN_CODE (insn) != -1) +for_each_rtx (insn, (rtx_function) ix86_loop_memcount, mem_count); if (NONDEBUG_INSN_P (insn)) for_each_rtx (PATTERN(insn), ...); otherwise your heuristics will depend on -g compile option. + if ( (mem_count*nunroll) = 32) Extra parenthesis. +static int +ix86_loop_memcount (rtx *x, unsigned *mem_count) { + if (*x != NULL_RTX MEM_P (*x)) *x will never be null for active insns. Uros. unroll-adjust.patch Description: unroll-adjust.patch
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
Ouch... mem_count can be zero. Is there a reason to change this part from previous patch? Oops! You're right. I will correct this. The idea is to count the memory references and decide on the unrolling factor. Previous patch does that in two steps I thought of doing that in a single step. (I think I missed my step here ;) ) Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Wednesday, December 04, 2013 3:17 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Richard Guenther richard.guent...@gmail.com (richard.guent...@gmail.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Dec 4, 2013 at 9:39 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: Attached is the revised patch. The target independent part has been already approved and added. This revision of the patch adds a x86 tune definition and checks it while deciding the unroll factor. Accommodated the comments given by you except one. *x will never be null for active insns. Since every rtx in the insn is checked for memory references, the NULL_RTX check is required. Yes you are correct. for_each_rtx also passes NULL_RTX, I was distracted by There are no sub-expressions. comment. +if (NONDEBUG_INSN_P (insn) INSN_CODE (insn) != -1) Do you need to check for INSN_CODE here? IIRC, checking for NONDEBUG_INSN_P is enough. +for_each_rtx (insn, (rtx_function) ix86_loop_memcount, mem_count); +} + free (bbs); + + if (mem_count =32) +return 32/mem_count; Ouch... mem_count can be zero. Is there a reason to change this part from previous patch? Uros.
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
This patch adds influence of macro TARGET_LOOP_UNROLL_ADJUST during constant iterations (decide_unroll_constant_iterations). The macro has been already checked for runtime iterations (decide_unroll_runtime_iterations), and for unroll stupid (decide_unroll_stupid). Bootstrapping and test passes. Would like to know your comments before committing. Regards Ganesh 2013-11-28 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * loop-unroll.c (decide_unroll_constant_iterations): Check macro TARGET_LOOP_UNROLL_ADJUST while deciding unroll factor. diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c index 9c87167..557915f 100644 --- a/gcc/loop-unroll.c +++ b/gcc/loop-unroll.c @@ -664,6 +664,9 @@ decide_unroll_constant_iterations (struct loop *loop, int flags) if (nunroll (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) nunroll = PARAM_VALUE (PARAM_MAX_UNROLL_TIMES); + if (targetm.loop_unroll_adjust) +nunroll = targetm.loop_unroll_adjust (nunroll, loop); + /* Skip big loops. */ if (nunroll = 1) { -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Friday, November 22, 2013 1:46 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Richard Guenther richard.guent...@gmail.com (richard.guent...@gmail.com); borntrae...@de.ibm.com; H.J. Lu (hjl.to...@gmail.com); Jakub Jelinek (ja...@redhat.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Nov 20, 2013 at 7:26 PM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: Steamroller processors contain a loop predictor and a loop buffer, which may make unrolling small loops less important. When unrolling small loops for steamroller, making the unrolled loop fit in the loop buffer should be a priority. This patch uses a heuristic approach (number of memory references) to decide the unrolling factor for small loops. This patch has some noise in SPEC 2006 results. Bootstrapping passes. I would like to know your comments before committing. Please split the patch to target-dependant and target-independant part, and get target-idependant part reviewed first.
RE: [PATCH, i386]: Fix PR56788, _mm_frcz_sd and _mm_frcz_ss ignore their second argument
Hopefully someone from AMD will provide tests that are mysteriously missing from XOP testsuite. As pointed out by Marc, I added myself to the bug later. I was bit confused about the internal insn representation with user-visible function. So, couldn't add test then and there. I could have solved that earlier. Sorry for that. Attached is the test that checks the (controversial) frcz functions. Uros could you please add this to your patch while committing. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Saturday, November 23, 2013 6:49 PM To: gcc-patches@gcc.gnu.org Cc: Cong Hou; Marc Glisse; Gopalasubramanian, Ganesh Subject: [PATCH, i386]: Fix PR56788, _mm_frcz_sd and _mm_frcz_ss ignore their second argument Hello! Attached patch fixes PR56788, where _mm_frcz_{ss,sd} intrinsics ignored their second argument. As explained in the PR [1], gcc implements two-operand vector-merge form as documented in Microsoft's definition [2]. However, in contrast to other SSE scalar insns, the instruction itself clears upper bits to zero. There were a couple of problems: the builtin was declared as builtin with two input operands, but the number of input operands didn't correspond to referred insn pattern, leaving its second operand uninitialized. The intrinsic was also implemented without necessary movss/movsd fixup that would merge both its operands in a correct way. Please also note that the definition in clang is wrong. I didn't include any testcase in the patch, since I don't have access to XOP target. Hopefully someone from AMD will provide tests that are mysteriously missing from XOP testsuite. 2013-11-23 Uros Bizjak ubiz...@gmail.com PR target/56788 * config/i386/i386.c (bdesc_multi_arg) IX86_BUILTIN_VFRCZSS: Declare as MULTI_ARG_1_SF instruction. IX86_BUILTIN_VFRCZSD: Decleare as MULTI_ARG_1_DF instruction. * config/i386/sse.md (*xop_vmfrczmode2): Rename from *xop_vmfrcz_mode. * config/i386/xopintrin.h (_mm_frcz_ss): Use __builtin_ia32_movss to merge scalar result with __A. (_mm_frcz_sd): Use __builtin_ia32_movsd to merge scalar result with __A. Patch was bootstrapped and regression tested on x86_64-pc-linux-gnu {,-m32}. The patch was committed to mainline SVN and will be committed to other release branches in a couple of days (hopefully with additional tests). [1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56788 [2] http://msdn.microsoft.com/en-us/library/vstudio/gg445126%28v=vs.100%29.aspx Uros. #include x86intrin.h #include m128-check.h void check_mm_vmfrcz_sd (__m128d __A, __m128d __B) { union128d a, b, c; double d[2]; a.x = __A; b.x = __B; c.x = _mm_frcz_sd (__A, __B); d[0] = b.a[0] - (int)b.a[0] ; d[1] = a.a[1]; if (check_union128d (c, d)) abort (); } void check_mm_vmfrcz_ss (__m128 __A, __m128 __B) { union128 a, b, c; float f[4]; a.x = __A; b.x = __B; c.x = _mm_frcz_ss (__A, __B); f[0] = b.a[0] - (int)b.a[0] ; f[1] = a.a[1]; f[2] = a.a[2]; f[3] = a.a[3]; if (check_union128 (c, f)) abort (); } void main (void) { union128 a, b; union128d c,d; int i; for (i = 0; i 4; i++) { a.a[i] = i + 3.5; b.a[i] = i + 7.9; } for (i = 0; i 2; i++) { c.a[i] = i + 3.5; d.a[i] = i + 7.987654321; } check_mm_vmfrcz_ss (a.x, b.x); check_mm_vmfrcz_sd (c.x, d.x); }
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
Ping! -Original Message- From: Gopalasubramanian, Ganesh Sent: Thursday, November 21, 2013 10:35 AM To: 'H.J. Lu' Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther richard.guent...@gmail.com (richard.guent...@gmail.com); borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com) Subject: RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 I suggest you add this to x86-tune.def and enable it for bdver3 and bdver4. The macro TARGET_LOOP_UNROLL_ADJUST is not new. It is already available and is used by target s390. Since it is not an x86 only feature I didn't add that in x86-tune.def. Regards Ganesh -Original Message- From: H.J. Lu [mailto:hjl.to...@gmail.com] Sent: Thursday, November 21, 2013 12:02 AM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther richard.guent...@gmail.com (richard.guent...@gmail.com); borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Nov 20, 2013 at 10:26 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: Hi, Steamroller processors contain a loop predictor and a loop buffer, which may make unrolling small loops less important. When unrolling small loops for steamroller, making the unrolled loop fit in the loop buffer should be a priority. This patch uses a heuristic approach (number of memory references) to decide the unrolling factor for small loops. This patch has some noise in SPEC 2006 results. Bootstrapping passes. I would like to know your comments before committing. I suggest you add this to x86-tune.def and enable it for bdver3 and bdver4. -- H.J.
[RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
Hi, Steamroller processors contain a loop predictor and a loop buffer, which may make unrolling small loops less important. When unrolling small loops for steamroller, making the unrolled loop fit in the loop buffer should be a priority. This patch uses a heuristic approach (number of memory references) to decide the unrolling factor for small loops. This patch has some noise in SPEC 2006 results. Bootstrapping passes. I would like to know your comments before committing. Regards Ganesh loop_unroll_bdver3.patch Description: loop_unroll_bdver3.patch
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
I suggest you add this to x86-tune.def and enable it for bdver3 and bdver4. The macro TARGET_LOOP_UNROLL_ADJUST is not new. It is already available and is used by target s390. Since it is not an x86 only feature I didn't add that in x86-tune.def. Regards Ganesh -Original Message- From: H.J. Lu [mailto:hjl.to...@gmail.com] Sent: Thursday, November 21, 2013 12:02 AM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther richard.guent...@gmail.com (richard.guent...@gmail.com); borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Nov 20, 2013 at 10:26 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: Hi, Steamroller processors contain a loop predictor and a loop buffer, which may make unrolling small loops less important. When unrolling small loops for steamroller, making the unrolled loop fit in the loop buffer should be a priority. This patch uses a heuristic approach (number of memory references) to decide the unrolling factor for small loops. This patch has some noise in SPEC 2006 results. Bootstrapping passes. I would like to know your comments before committing. I suggest you add this to x86-tune.def and enable it for bdver3 and bdver4. -- H.J.
RE: Honnor ix86_accumulate_outgoing_args again
we are going to have some AMD CPU with AVX2 support soon, the question is if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even if it will prefer split, the question is if like bdver{1,2,3} it will be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned loads/stores are handled is much less important there. Ganesh? 256-bit is friendly on bdver4. But, 256 bit unaligned stores are micro-coded which we would like to avoid. So we require 128-bit MOVUPS. -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Tuesday, November 12, 2013 3:57 PM To: Jan Hubicka Cc: H.J. Lu; Vladimir Makarov; GCC Patches; Uros Bizjak; Richard Henderson; Gopalasubramanian, Ganesh Subject: Re: Honnor ix86_accumulate_outgoing_args again On Tue, Nov 12, 2013 at 11:05:45AM +0100, Jan Hubicka wrote: @@ -16576,7 +16576,7 @@ ix86_avx256_split_vector_move_misalign (rtx op0, rtx op1) if (MEM_P (op1)) { - if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD) + if (!TARGET_AVX2 TARGET_AVX256_SPLIT_UNALIGNED_LOAD) { rtx r = gen_reg_rtx (mode); m = adjust_address (op1, mode, 0); @@ -16596,7 +16596,7 @@ ix86_avx256_split_vector_move_misalign (rtx op0, rtx op1) } else if (MEM_P (op0)) { - if (TARGET_AVX256_SPLIT_UNALIGNED_STORE) + if (!TARGET_AVX2 TARGET_AVX256_SPLIT_UNALIGNED_STORE) I would add explanation comment on those two. Looking at http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01235.html we are going to have some AMD CPU with AVX2 support soon, the question is if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even if it will prefer split, the question is if like bdver{1,2,3} it will be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned loads/stores are handled is much less important there. Ganesh? Shall we also disable argument accumulation for cores? It seems we won't solve the IRA issues, right? You mean LRA issues here, right? If you are starting to use no-accumulate-outgoing-args much more often than in the past, I think the problem that LRA forces a frame pointer in that case is much more important now (or has that been fixed in the mean time?). Vlad? Jakub
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Thanks Honza! I have committed changes ( for default ). http://gcc.gnu.org/viewcvs/gcc?view=revisionrevision=204442 I will add lookahead value 8 for O3 after experimenting with it. Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Wednesday, October 30, 2013 1:54 AM To: Richard Biener Cc: Jan Hubicka; Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); H.J. Lu (hjl.to...@gmail.com) Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips On Fri, 25 Oct 2013, Jan Hubicka wrote: OK, so it is about 2%. Did you try if you need lookahead even in the early pass (before reload)? My guess would be so, but if not, it could cut the cost to half. For -Ofast/-O3 it looks resonable to me, but we will need to announce it on the ML. For other settings I think we need to work on more improvements or cut the expenses. Yes, it is required before reload. I have another idea which can be pondered upon. Currently, can we enable lookahead with the value 4 (pre reload) for default? This will exponentially cut the cost of build time. I have done some measurements on the build time of some benchmarks (mentioned below) with lookahead value 4. The 2% increase in build time with value 8 is now almost gone. dfa4 no_lookahead perlbench - 191s 193s bzip2 - 19s 19s gcc - 429s 429s mcf - 3s3s gobmk - 116s 115s hmmer - 60s 60s sjeng - 18s 17s libquantum - 6s6s h264ref - 107s 107s omnetpp - 128s 128s astar - 7s7s bwaves - 5s5s gamess - 1964s 1957s milc- 18s 18s GemsFDTD- 273s 272s Lookahead value 4 also helps because, the modified decoder model in bdver3.md is only two cycles deep (though in hardware it is actually 4 cycles deep). This means that we can look another two levels deep for better schedule. GemsFDTD still retains the performance boost of around 6-7% with value 4. Let me know your thoughts. This seems resonable. I would go for lookahead of 4 for now and 8 for -Ofast and we can tune things based on the experience with this setting incrementally. Uros, Richard, what do you think? Well, certainly -O3 not -Ofast. Yes, enabling 4 by default and 8 at -O3 seems fine to me. Honza Richard.
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
OK, so it is about 2%. Did you try if you need lookahead even in the early pass (before reload)? My guess would be so, but if not, it could cut the cost to half. For -Ofast/-O3 it looks resonable to me, but we will need to announce it on the ML. For other settings I think we need to work on more improvements or cut the expenses. Yes, it is required before reload. I have another idea which can be pondered upon. Currently, can we enable lookahead with the value 4 (pre reload) for default? This will exponentially cut the cost of build time. I have done some measurements on the build time of some benchmarks (mentioned below) with lookahead value 4. The 2% increase in build time with value 8 is now almost gone. dfa4 no_lookahead perlbench - 191s 193s bzip2 - 19s 19s gcc - 429s 429s mcf - 3s3s gobmk - 116s 115s hmmer - 60s 60s sjeng - 18s 17s libquantum - 6s6s h264ref - 107s 107s omnetpp - 128s 128s astar - 7s7s bwaves - 5s5s gamess - 1964s 1957s milc- 18s 18s GemsFDTD- 273s 272s Lookahead value 4 also helps because, the modified decoder model in bdver3.md is only two cycles deep (though in hardware it is actually 4 cycles deep). This means that we can look another two levels deep for better schedule. GemsFDTD still retains the performance boost of around 6-7% with value 4. Let me know your thoughts. Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Thursday, October 24, 2013 6:48 PM To: Gopalasubramanian, Ganesh Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); H.J. Lu (hjl.to...@gmail.com) Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips Hi, Is this with -fschedule-insns? Or only with default settings? Did you test the compile time implications of increasing the lookahead? (value of 8 is very large, we may consider enbling it only for -Ofast, limiting for postreload only or something similar). The improvement is seen with the options -fschedule-insns -fschedule-insns2 -fsched-pressure Below are the build times of some of the SPEC benchmarks dfa8 no_lookahead perlbench - 196s 193s bzip2 - 19s 19s gcc - 439s 429s mcf - 3s3s gobmk - 119s 115s hmmer - 62s 60s sjeng - 18s 17s libquantum - 6s6s h264ref - 110s 107s omnetpp - 132s 128s astar - 7s7s bwaves - 4s5s gamess - 1996s 1957s milc- 18s 18s GemsFDTD- 276s 272s I think we can enable it by default rather than for -Ofast. Please let me know your inputs. OK, so it is about 2%. Did you try if you need lookahead even in the early pass (before reload)? My guess would be so, but if not, it could cut the cost to half. For -Ofast/-O3 it looks resonable to me, but we will need to announce it on the ML. For other settings I think we need to work on more improvmeents or cut the expenses. Honza Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Thursday, October 24, 2013 2:54 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); hubi...@ucw.cz; H.J. Lu (hjl.to...@gmail.com) Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips Attached is the patch which does the following scheduler related changes. * re-models bdver3 decoder. * It enables lookahead with value 8 for all BD architectures. The patch doesn't consider if reloading is completed or not (an area that needs to be worked on). * The issue rate for BD architectures are set to 4. I see the following performance improvements on bdver3 machine. * GemsFDTD improves by 6-7% with lookahead value changed to 8. * Hmmer improves by 9% when issue rate when set to 4 . Is this with -fschedule-insns? Or only with default settings? Did you test the compile time implications of increasing the lookahead? (value of 8 is very large, we may consider enbling it only for -Ofast, limiting for postreload only or something similar). I have considered the following hardware details for the model. * There are four decoders inside a hardware decoder block. * These four independent decoders can execute in parallel. (They can take 8B from four different instructions and decode). * These four decoders are pipelined 4 cycles deep and are non-stalling
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Attached is the patch which does the following scheduler related changes. * re-models bdver3 decoder. * It enables lookahead with value 8 for all BD architectures. The patch doesn't consider if reloading is completed or not (an area that needs to be worked on). * The issue rate for BD architectures are set to 4. I see the following performance improvements on bdver3 machine. * GemsFDTD improves by 6-7% with lookahead value changed to 8. * Hmmer improves by 9% when issue rate when set to 4 . I have considered the following hardware details for the model. * There are four decoders inside a hardware decoder block. * These four independent decoders can execute in parallel. (They can take 8B from four different instructions and decode). * These four decoders are pipelined 4 cycles deep and are non-stalling. * Each decoder takes 8B of instruction data every cycle and tries decoding it. * Issue rate is 4. Is it OK for upstream? Changelog 2013-10-24 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/i386/bdver3.md : Added two additional decoder units to support issue rate of 4 and remodeled vector unit. * config/i386/i386.c (ix86_issue_rate): Issue rate for BD architectures is set to 4. * config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA lookahead is set to 8 for BD architectures. Regards Ganesh bdver3_issue_rate_lookahead.patch Description: bdver3_issue_rate_lookahead.patch
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Hi, Is this with -fschedule-insns? Or only with default settings? Did you test the compile time implications of increasing the lookahead? (value of 8 is very large, we may consider enbling it only for -Ofast, limiting for postreload only or something similar). The improvement is seen with the options -fschedule-insns -fschedule-insns2 -fsched-pressure Below are the build times of some of the SPEC benchmarks dfa8 no_lookahead perlbench - 196s 193s bzip2 - 19s 19s gcc - 439s 429s mcf - 3s3s gobmk - 119s 115s hmmer - 62s 60s sjeng - 18s 17s libquantum - 6s6s h264ref - 110s 107s omnetpp - 132s 128s astar - 7s7s bwaves - 4s5s gamess - 1996s 1957s milc- 18s 18s GemsFDTD- 276s 272s I think we can enable it by default rather than for -Ofast. Please let me know your inputs. Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Thursday, October 24, 2013 2:54 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); hubi...@ucw.cz; H.J. Lu (hjl.to...@gmail.com) Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips Attached is the patch which does the following scheduler related changes. * re-models bdver3 decoder. * It enables lookahead with value 8 for all BD architectures. The patch doesn't consider if reloading is completed or not (an area that needs to be worked on). * The issue rate for BD architectures are set to 4. I see the following performance improvements on bdver3 machine. * GemsFDTD improves by 6-7% with lookahead value changed to 8. * Hmmer improves by 9% when issue rate when set to 4 . Is this with -fschedule-insns? Or only with default settings? Did you test the compile time implications of increasing the lookahead? (value of 8 is very large, we may consider enbling it only for -Ofast, limiting for postreload only or something similar). I have considered the following hardware details for the model. * There are four decoders inside a hardware decoder block. * These four independent decoders can execute in parallel. (They can take 8B from four different instructions and decode). * These four decoders are pipelined 4 cycles deep and are non-stalling. * Each decoder takes 8B of instruction data every cycle and tries decoding it. * Issue rate is 4. What is the overall limitation on number of bytes the instructions can occupy? I think they need to fit into 2 16 byte windows, right? In that case we may want to tweak the existing corei7 scheduling code to take care of this. Making scheduler not overly optimistic about the parallelism is good since it will make less register pressure during the first pass. Is it OK for upstream? Otherwise the patch seems OK, but I would like to know the compile time effect first. Honza Changelog 2013-10-24 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/i386/bdver3.md : Added two additional decoder units to support issue rate of 4 and remodeled vector unit. * config/i386/i386.c (ix86_issue_rate): Issue rate for BD architectures is set to 4. * config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA lookahead is set to 8 for BD architectures. Regards Ganesh
[PATCH,i386] Enable FMA4 for AMD bdver3
Hi The below patch enables FMA4 for AMD bdver3 architectures. make -k check passes. Is it OK for upstream? Regards Ganesh diff --git a/gcc/ChangeLog b/gcc/ChangeLog index fb5b267..cbb5311 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-10-16 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com + + * config/i386/i386.c (ix86_option_override_internal): Enable FMA4 + for AMD bdver3. + 2013-10-16 Hans-Peter Nilsson h...@axis.com * config/cris/t-elfmulti (MULTILIB_OPTIONS, MULTILIB_DIRNAMES) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index b5796db..c24ce36 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -3104,7 +3104,7 @@ ix86_option_override_internal (bool main_args_p, {bdver3, PROCESSOR_BDVER3, CPU_BDVER3, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1 - | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX + | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE},
RE: [PATCH,i386] Enable FMA4 for AMD bdver3
4.8.2 is already rolling, so too late for that. Is 4.8 branch (gcc/branches/gcc-4_8-branch) open? If yes, shall I commit these changes? Regards Ganesh -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Wednesday, October 16, 2013 12:41 PM To: Uros Bizjak Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Enable FMA4 for AMD bdver3 On Wed, Oct 16, 2013 at 09:00:58AM +0200, Uros Bizjak wrote: On Wed, Oct 16, 2013 at 8:28 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: The below patch enables FMA4 for AMD bdver3 architectures. make -k check passes. +2013-10-16 Ganesh Gopalasubramanian +ganesh.gopalasubraman...@amd.com + + * config/i386/i386.c (ix86_option_override_internal): Enable FMA4 + for AMD bdver3. + OK for mainline and 4.8 branch (for 4.8.2 if approved by Jakub, otherwise please wait for branch to open). 4.8.2 is already rolling, so too late for that. Jakub
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Hi Honza! I will give it a try. I understand how the proposed patch would be. That is OK to me. OK, my undertanding always was, that the decoders works sort of sequentially. One decoder of bdver1 can do vector, double, single, single, single, signle where decoder 1 is somehow special but hardware is able to swap first and second single. Now if two decoders run, i would expect vector, vector single, single, signle double, single, double, single to be decoded in once cycle My understanding on the decode unit is mentioned below. Please correct me if I am wrong. The sequential allotment of decoders is not there for bdver1. Intel Sandybridge\core2 have four decoders. The first decoder is special for intel processors. For ex, In Sandybridge, the instructions that have only one µop can be decoded by any of the four decoders. Instructions that have up to four µops will be decoded by the first decoder (of the four decoders available) only. Bdver1 has four decoders. None of them is special unlike intel processors. For bdver1, microcoded instructions are single issue. All four decoders are engaged for decoding microcoded instructions. Decode unit of bdver3 has following specifications * Four independent decoders which can execute in parallel * Microcoded instructions are single issue. (All four decoders are engaged). This means that only one vectorpath instruction get issued in one cycle. * The additional hardware instruction decoder increases the instruction decode capacity to eight instructions per clock cycle. * The decoders are pipelined 4 cycles deep and are non-stalling. So modeling vectorpath instructions is straightforward. No instructions are issued along with vector instructions. We need to model only fastpath single and fastpath double instructions. There are four decoders and they can execute in parallel. So they can take either two double or four single instructions. We also don't need to model them in two stage as there is no sequence involved. So, the modeling can be done such that in one cycle we schedule 2 singles + 1 double (or) 4 singles (or) 2 doubles. I have tried to model this for bdver3 (code changes are mentioned below). Please let me know your opinion. Regards Ganesh Patch - diff --git a/gcc/config/i386/bdver3.md b/gcc/config/i386/bdver3.md index 52418b5..9e59395 100644 --- a/gcc/config/i386/bdver3.md +++ b/gcc/config/i386/bdver3.md @@ -34,6 +34,8 @@ (define_cpu_unit bdver3-decode0 bdver3) (define_cpu_unit bdver3-decode1 bdver3) +(define_cpu_unit bdver3-decode2 bdver3) +(define_cpu_unit bdver3-decode3 bdver3) (define_cpu_unit bdver3-decodev bdver3) ;; Double decoded instructions take two cycles whereas @@ -42,12 +44,15 @@ ;; two decoders in two cycles. ;; Vectorpath instructions are single issue instructions. ;; So, we have separate unit for vector instructions. -(exclusion_set bdver3-decodev bdver3-decode0,bdver3-decode1) +(exclusion_set bdver3-decodev bdver3-decode0,bdver3-decode1,bdver3-decode2,bdver3-decode3) (define_reservation bdver3-vector bdver3-decodev) -(define_reservation bdver3-direct (bdver3-decode0|bdver3-decode1)) +(define_reservation bdver3-direct (bdver3-decode0|bdver3-decode1|bdver3-decoder2|bdver3-decoder3)) -(define_reservation bdver3-double (bdver3-decode0|bdver3-decode1)*2) +(define_reservation bdver3-double (bdver3-decode0+bdver3-decode1)| + (bdver3-decode1+bdver3-decode2)|(bdver3-decode2+bdver3-decode3)| + (bdver3-decode0+bdver3-decode2)|(bdver3-decode1+bdver3-decode3)| + (bdver3-decode0+bdver3-decode3)) -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Wednesday, October 09, 2013 7:18 PM To: Gopalasubramanian, Ganesh Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips Before merging the insn reservations, I need to compare the latency values for bdver1 and bdver3. I know that they are different for some of the instructions. In that case, the merging should prop up another subset of latency differences. I would like to keep these insn reservations in two .md files (one for bdver1 and one for bdver3) even after the merger. I am not really insisting on merging (define_insn_reservation bdver3*) with (define_insn_reservation bdver1*). What I have in mind is merging actual atuomatons in cases it makes sense. Latencies are not really encoded in those. Bdver 12 has: (define_automaton bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu) while bdver 3: (define_automaton bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu) automatons bdver1 and bdver3 are very different, because one handles up to 3 instructions, while other handles only 2. I am still bit confused with this every second cycle logic, so lets discuss it incrementally. I would propose to have (define_automaton bdver3) or perhaps (define_automaton bdver3,bdver3_fp) now
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Hi Honza, Yep, I think we need to merge only those autmatas tha are same for both: (define_automaton bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu) probably can become (define_automaton bdver3,bdver3_fp) with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu changed to bdver1 automaton. I think it should result in smaller binary - the fact that all conditionals are physically duplicated in bdver1/bdev3.md should be optimized away by genautomata. Before merging the insn reservations, I need to compare the latency values for bdver1 and bdver3. I know that they are different for some of the instructions. In that case, the merging should prop up another subset of latency differences. I would like to keep these insn reservations in two .md files (one for bdver1 and one for bdver3) even after the merger. Your version has problem that it does not model the thing that the two decoders works sequentially. The two stage modeling is required so that the decode unit reservations are screened from other unit reservations. But this sort of goes away in bdver3 because of the decode cycle. In bdver3, the decode units scan two of these windows every two cycles decoding a maximum of eight instructions. The hardware scan is done every two cycles in bdver3 whereas it is done every single cycle in bdver1/bdver2. (But we have two separate hardware decoders which guarantees higher throughput) This means that the two stage modeling is not required in the scheduler descriptions since the hardware sort of guarantees that with its scanning mechanism. Our job is to make sure that 8 direct instructions get scheduled in two cycles or 4 double instructions get scheduled in two cycles. So, I have modeled the bdver3 decoders such that with in a cycle they guarantee to issue 4 direct instructions or 2 double instructions. This eliminates the sequencing problem in modeling decoders and also ensures that the issue rate can be numbered for a single cycle rather than two cycles. This is one of the reasons why I remodeled only bdver3. Let me know your comments on this. We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more realistic estimates on what still can be issued - the value of 6 is unrealistically high. This would get more complicated if we go by decoder capacity in bdver3. As we have two hardware decoders in steamroller (bdver3), they have a capacity to decode eight instructions per clock cycle, providing up to twice the decode and dispatch bandwidth compared to bdver1. If we model this in GCC we need to change the issue rate to 8. If 6 is high, then 8 would add more joy and excitement. TARGET_SCHED_VARIABLE_ISSUE is a nice suggestion to schedule instructions in different way. We also should enable ia32_multipass_dfa_lookahead - with that scheduler should be able to put double decoded and vector decoded insns on the proper places. Yes. Whenever we have this scheduler analysis in place we discuss about this but unfortunately is left as it is. I will look into this after I do the enablement for bdver4. I will work on replacing most of the CPU cases into tuning flags + costs. I am planning to get bdver4 enablement in place once scheduler descriptions for bdver3 is done with. I will have cycles to look into the cost models. Please delegate some tasks if you can and I am willing to take them up. Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Tuesday, October 08, 2013 3:20 PM To: Gopalasubramanian, Ganesh Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips Hi Honza, I am planning to update the scheduler descriptions for bdver3 first. Attached is the patch. Please let me know your comments if any. Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and decoding schemes are different. So, let me know how can I approach merging these. Yep, I think we need to merge only those autmatas tha are same for both: (define_automaton bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu) probably can become (define_automaton bdver3,bdver3_fp) with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu changed to bdver1 automaton. I think it should result in smaller binary - the fact that all conditionals are physically duplicated in bdver1/bdev3.md should be optimized away by genautomata. I also played a bit with the decoders and I am attaching my version - that seems SPEC neutral though. Your version has problem that it does not model the thing that the two decoders works sequentially. I removed the bdver1-decodev unit and instead i simply reserve all thre decoders + I added presence set requring second decoder to be taken only after first one changed presence set requring decoder 2 to be taken only after decoder 1+2 to final presence set, so
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Hi Honza, I am planning to update the scheduler descriptions for bdver3 first. Attached is the patch. Please let me know your comments if any. Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and decoding schemes are different. So, let me know how can I approach merging these. Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Monday, September 30, 2013 4:47 PM To: gcc-patches@gcc.gnu.org; Gopalasubramanian, Ganesh; hjl.to...@gmail.com Subject: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips Hi, while looking into schedules produced for Buldozer and Core I noticed that they do not seem to match reality. This is because ix86_issue_rate limits those CPUs into 3 instructions per cycle, while they are designed to do 4 and somewhat confused ix86_adjust_cost. I also added stack engine into modern chips even though scheduler doesn't really understand that multiple push operations can happen in one cycle. At least it gets the stack updates in sequences of push/pop operations. I did not updated buldozer issue rates yet. The current scheduler model won't allow it to execute more than 3 instructions per cycle (and 2 for version 3). I think bdver1.md/bdver3.md needs to be updated first. I am testing x86_64-linux and will commit if there are no complains. Honza * i386.c (ix86_issue_rate): Pentium4/Nocona issue 2 instructions per cycle, Core/CoreI7/Haswell 4 instructions per cycle. (ix86_adjust_cost): Add stack engine to modern AMD chips; fix for core; remove Atom that mistakely shared code with AMD. Index: config/i386/i386.c === --- config/i386/i386.c (revision 203011) +++ config/i386/i386.c (working copy) @@ -24435,17 +24435,14 @@ ix86_issue_rate (void) case PROCESSOR_SLM: case PROCESSOR_K6: case PROCESSOR_BTVER2: +case PROCESSOR_PENTIUM4: +case PROCESSOR_NOCONA: return 2; case PROCESSOR_PENTIUMPRO: -case PROCESSOR_PENTIUM4: -case PROCESSOR_CORE2: -case PROCESSOR_COREI7: -case PROCESSOR_HASWELL: case PROCESSOR_ATHLON: case PROCESSOR_K8: case PROCESSOR_AMDFAM10: -case PROCESSOR_NOCONA: case PROCESSOR_GENERIC: case PROCESSOR_BDVER1: case PROCESSOR_BDVER2: @@ -24453,6 +24450,11 @@ ix86_issue_rate (void) case PROCESSOR_BTVER1: return 3; +case PROCESSOR_CORE2: +case PROCESSOR_COREI7: +case PROCESSOR_HASWELL: + return 4; + default: return 1; } @@ -24709,10 +24711,15 @@ ix86_adjust_cost (rtx insn, rtx link, rt case PROCESSOR_BDVER3: case PROCESSOR_BTVER1: case PROCESSOR_BTVER2: -case PROCESSOR_ATOM: case PROCESSOR_GENERIC: memory = get_attr_memory (insn); + /* Stack engine allows to execute pushpop instructions in parall. */ + if (((insn_type == TYPE_PUSH || insn_type == TYPE_POP) + (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP)) + (ix86_tune != PROCESSOR_ATHLON ix86_tune != PROCESSOR_K8)) + return 0; + /* Show ability of reorder buffer to hide latency of load by executing in parallel with previous instruction in case previous instruction is not needed to compute the address. */ @@ -24737,6 +24744,29 @@ ix86_adjust_cost (rtx insn, rtx link, rt else cost = 0; } + break; + +case PROCESSOR_CORE2: +case PROCESSOR_COREI7: +case PROCESSOR_HASWELL: + memory = get_attr_memory (insn); + + /* Stack engine allows to execute pushpop instructions in parall. */ + if ((insn_type == TYPE_PUSH || insn_type == TYPE_POP) + (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP)) + return 0; + + /* Show ability of reorder buffer to hide latency of load by executing +in parallel with previous instruction in case +previous instruction is not needed to compute the address. */ + if ((memory == MEMORY_LOAD || memory == MEMORY_BOTH) + !ix86_agi_dependent (dep_insn, insn)) + { + if (cost = 4) + cost -= 4; + else + cost = 0; + } break; case PROCESSOR_SLM: issue_rate_bdver3.patch Description: issue_rate_bdver3.patch
RE: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
1. For cmp/test with rip-relative addressing mem operand, don't group insns. Bulldozer also doesn't support fusion for cmp/test with both displacement MEM and immediate operand, while m_CORE_ALL doesn't support fusion for cmp/test with MEM and immediate operand. I simplify choose to use the more stringent constraint here (m_CORE_ALL's constraint). This suits Bulldozer's specification. We don't see an issue with the proposed patch. Regards Ganesh -Original Message- From: H.J. Lu [mailto:hjl.to...@gmail.com] Sent: Wednesday, September 25, 2013 2:12 AM To: Wei Mi Cc: Jan Hubicka; Alexander Monakov; Steven Bosscher; GCC Patches; David Li; Kirill Yukhin Subject: Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion On Tue, Sep 24, 2013 at 12:06 PM, Wei Mi w...@google.com wrote: This is the updated patch2. Changed: 1. For cmp/test with rip-relative addressing mem operand, don't group insns. Bulldozer also doesn't support fusion for cmp/test with both displacement MEM and immediate operand, while m_CORE_ALL doesn't support fusion for cmp/test with MEM and immediate operand. I simplify choose to use the more stringent constraint here (m_CORE_ALL's constraint). 2. Add Budozer back and merge TARGET_FUSE_CMP_AND_BRANCH_64 and TARGET_FUSE_CMP_AND_BRANCH_32. bootstrap and regression pass. ok for trunk? 2013-09-24 Wei Mi w...@google.com * gcc/config/i386/i386.c (rip_relative_addr_p): New Function. (ix86_macro_fusion_p): Ditto. (ix86_macro_fusion_pair_p): Ditto. * gcc/config/i386/i386.h: Add new tune features about macro-fusion. * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto. * gcc/doc/tm.texi: Generated. * gcc/doc/tm.texi.in: Ditto. * gcc/haifa-sched.c (try_group_insn): New Function. (group_insns_for_macro_fusion): Ditto. (sched_init): Call group_insns_for_macro_fusion. * gcc/sched-rgn.c (add_branch_dependences): Keep insns in a SCHED_GROUP at the end of BB to remain their location. * gcc/target.def: Add two hooks: macro_fusion_p and macro_fusion_pair_p. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 1fd3f60..4a04778 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -24856,6 +24856,167 @@ ia32_multipass_dfa_lookahead (void) } } +/* Extracted from ix86_print_operand_address. Check whether ADDR is a + rip-relative address. */ + +static bool +rip_relative_addr_p (rtx addr) +{ + struct ix86_address parts; + rtx base, index, disp; + int ok; + + if (GET_CODE (addr) == UNSPEC XINT (addr, 1) == UNSPEC_VSIBADDR) +{ + ok = ix86_decompose_address (XVECEXP (addr, 0, 0), parts); + parts.index = XVECEXP (addr, 0, 1); +} + else if (GET_CODE (addr) == UNSPEC XINT (addr, 1) == UNSPEC_LEA_ADDR) +ok = ix86_decompose_address (XVECEXP (addr, 0, 0), parts); else +ok = ix86_decompose_address (addr, parts); + + gcc_assert (ok); + base = parts.base; + index = parts.index; + disp = parts.disp; + + if (TARGET_64BIT !base !index) +{ + rtx symbol = disp; + + if (GET_CODE (disp) == CONST + GET_CODE (XEXP (disp, 0)) == PLUS + CONST_INT_P (XEXP (XEXP (disp, 0), 1))) + symbol = XEXP (XEXP (disp, 0), 0); + + if (GET_CODE (symbol) == LABEL_REF + || (GET_CODE (symbol) == SYMBOL_REF + SYMBOL_REF_TLS_MODEL (symbol) == 0)) + return true; +} + if (flag_pic !base !index) +{ + if (GET_CODE (disp) == CONST + GET_CODE (XEXP (disp, 0)) == UNSPEC + (XINT (XEXP (disp, 0), 1) == UNSPEC_PCREL + || XINT (XEXP (disp, 0), 1) == UNSPEC_GOTPCREL + || (TARGET_64BIT + XINT (XEXP (disp, 0), 1) == UNSPEC_GOTNTPOFF))) + return true; +} + return false; +} + It doesn't look right. IP relative address is only possible with TARGET_64BIT and 1. base == pc. Or 2. UUNSPEC_PCREL, UNSPEC_GOTPCREL, and NSPEC_GOTNTPOFF. -- H.J.
RE: [PATCH,i386] Default alignment for AMD BD and BT
Thanks Jakub! Committed revision 201402. -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Thursday, July 04, 2013 4:46 PM To: Gopalasubramanian, Ganesh Cc: Uros Bizjak (ubiz...@gmail.com); gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT On Thu, Jul 04, 2013 at 11:14:24AM +, Gopalasubramanian, Ganesh wrote: Can this be backported now! Yes. Jakub
RE: [PATCH,i386] Default alignment for AMD BD and BT
Hi Uros, Can this be backported now! Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Thursday, May 30, 2013 1:40 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT On Wed, May 29, 2013 at 1:28 PM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: We want this to be backported to GCC48 branch. Please approve. -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Tuesday, May 07, 2013 6:22 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: The patch updates the alignment values for AMD BD and BT architectures. make -k check passes. Is it OK for upstream? 2013-05-07 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/i386/i386.c (processor_target_table): Modified default alignment values for AMD BD and BT architectures. This is OK, but please wait until 4.8 branch is open again. Thanks, Uros.
RE: [PATCH,i386] FP Reassociation for AMD bdver1 and bdver2
Thanks Uros! Committed at r199405. -Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Thursday, May 23, 2013 4:47 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] FP Reassociation for AMD bdver1 and bdver2 On Thu, May 23, 2013 at 1:11 PM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: The patch enables FP Reassociation pass AMD bdver1 and bdver2 architectures. We note a performance uplift of around ~8% on calculix. make -k check passes. Is it OK for upstream? OK. Thanks, Uros.
RE: [PATCH,i386] Default alignment for AMD BD and BT
Hi We want this to be backported to GCC48 branch. Please approve. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Tuesday, May 07, 2013 6:22 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: The patch updates the alignment values for AMD BD and BT architectures. make -k check passes. Is it OK for upstream? 2013-05-07 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/i386/i386.c (processor_target_table): Modified default alignment values for AMD BD and BT architectures. The value 11 indeed looks a bit weird, but it means: align to 16 byte boundary only if this can be done by skipping 10 bytes or less. Tha patch is OK for mainline. Thanks, Uros.
RE: [PATCH, i386]: Update processor_alias_table for missing PTA_PRFCHW and PTA_FXSR flags
Thank you Uros for the patch. Could you backport this to the 4.8.0? -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Wednesday, May 15, 2013 11:16 PM To: gcc-patches@gcc.gnu.org Cc: Gopalasubramanian, Ganesh Subject: [PATCH, i386]: Update processor_alias_table for missing PTA_PRFCHW and PTA_FXSR flags Hello! Attached patch adds missing PTA_PRFCHW and PTA_FXSR flags to x86 processor alias table. PRFCHW CPUID flag is shared with 3dnow prefetch flag, so some additional logic is needed to avoid generating SSE prefetches for non-SSE 3dNow! targets, while still generating full set of 3dnow prefetches on 3dNow! targets. 2013-05-15 Uros Bizjak ubiz...@gmail.com * config/i386/i386.c (iy86_option_override_internal): Update processor_alias_table for missing PTA_PRFCHW and PTA_FXSR flags. Add PTA_POPCNT to corei7 entry and remove PTA_SSE from athlon-4 entry. Do not enable SSE prefetch on non-SSE 3dNow! targets. Enable TARGET_PRFCHW for TARGET_3DNOW targets. * config/i386/i386.md (prefetch): Enable for TARGET_PRFCHW instead of TARGET_3DNOW. (*prefetch_3dnow): Enable for TARGET_PRFCHW only. Patch was bootstrapped and regression tested on x86_64-pc-linux-gnu {,-m32} and was committed to mainline SVN. The patch will be backported to 4.8 branch in a couple of days. Uros.
RE: [PATCH,i386] FSGSBASE for AMD bdver3
Thank you Uros! Patch for FSGSBASE instruction generation for AMD bdver3 committed to trunk (rr198916). Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Monday, May 13, 2013 5:50 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] FSGSBASE for AMD bdver3 On Mon, May 13, 2013 at 1:54 PM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: The patch enables FSGSBASE instruction generation for AMD bdver3 architectures. make -k check passes. Is it OK for upstream? OK. Please also check for missing PTA_PRFCHW and PTA_FXSR for AMD processors in processor_alias_table. Thanks, Uros.
RE: [PATCH,i386] FSGSBASE for AMD bdver3
Thanks Uros! I think you mean the amdfam10 ISA mismatch between march=native and march=amdfam10. The below patch fills the gap. make -k check passes. Regards Ganesh 2013-05-07 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/i386/i386.c (processor_alias_table): Mismatch in ISAs Between march=native and march=amdfam10 is fixed. --- ./wkcpy/gcc-4.9.0/gcc/config/i386/i386.c 2013-02-21 16:27:10.0 +0530 +++ ./source/gcc-4.9.0/gcc/config/i386/i386.c 2013-10-21 22:20:28.0 +0530 @@ -2964,7 +2964,8 @@ | PTA_SSE2 | PTA_NO_SAHF}, {amdfam10, PROCESSOR_AMDFAM10, CPU_AMDFAM10, PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE - | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM}, + | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM + | PTA_FXSR | PTA_PRFCHW}, {barcelona, PROCESSOR_AMDFAM10, CPU_AMDFAM10, PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM}, -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Monday, May 13, 2013 5:50 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] FSGSBASE for AMD bdver3 On Mon, May 13, 2013 at 1:54 PM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: The patch enables FSGSBASE instruction generation for AMD bdver3 architectures. make -k check passes. Is it OK for upstream? OK. Please also check for missing PTA_PRFCHW and PTA_FXSR for AMD processors in processor_alias_table. Thanks, Uros.
RE: [PATCH,i386] Default alignment for AMD BD and BT
Thank you Uros! Committed r198820. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Tuesday, May 07, 2013 6:22 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: The patch updates the alignment values for AMD BD and BT architectures. make -k check passes. Is it OK for upstream? 2013-05-07 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/i386/i386.c (processor_target_table): Modified default alignment values for AMD BD and BT architectures. The value 11 indeed looks a bit weird, but it means: align to 16 byte boundary only if this can be done by skipping 10 bytes or less. Tha patch is OK for mainline. Thanks, Uros.
[PATCH,i386] FSGSBASE for AMD bdver3
Hi The patch enables FSGSBASE instruction generation for AMD bdver3 architectures. make -k check passes. Is it OK for upstream? Regards Ganesh Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 198821) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-05-13 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com + +* config/i386/i386.c (processor_alias_table): Add instruction +FSGSBASE for AMD bdver3 architecture. + 2013-05-13 Martin Jambor mjam...@suse.cz PR middle-end/42371 Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 198821) +++ gcc/config/i386/i386.c (working copy) @@ -3000,7 +3000,7 @@ | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE - | PTA_XSAVEOPT}, + | PTA_XSAVEOPT | PTA_FSGSBASE}, {btver1, PROCESSOR_BTVER1, CPU_GENERIC64, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW
[PATCH,i386] Default alignment for AMD BD and BT
Hi The patch updates the alignment values for AMD BD and BT architectures. make -k check passes. Is it OK for upstream? Regards Ganesh 2013-05-07 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/i386/i386.c (processor_target_table): Modified default alignment values for AMD BD and BT architectures. Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 198386) +++ gcc/config/i386/i386.c (working copy) @@ -2450,11 +2450,11 @@ {generic32_cost, 16, 7, 16, 7, 16}, {generic64_cost, 16, 10, 16, 10, 16}, {amdfam10_cost, 32, 24, 32, 7, 32}, - {bdver1_cost, 32, 24, 32, 7, 32}, - {bdver2_cost, 32, 24, 32, 7, 32}, - {bdver3_cost, 32, 24, 32, 7, 32}, - {btver1_cost, 32, 24, 32, 7, 32}, - {btver2_cost, 32, 24, 32, 7, 32}, + {bdver1_cost, 16, 10, 16, 7, 11}, + {bdver2_cost, 16, 10, 16, 7, 11}, + {bdver3_cost, 16, 10, 16, 7, 11}, + {btver1_cost, 16, 10, 16, 7, 11}, + {btver2_cost, 16, 10, 16, 7, 11}, {atom_cost, 16, 15, 16, 7, 16} };
RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores
Thank you Gerald! Committed with the changes. Regards Ganesh -Original Message- From: Gerald Pfeifer [mailto:ger...@pfeifer.com] Sent: Thursday, February 14, 2013 2:40 PM To: Gopalasubramanian, Ganesh Cc: gcc-patchesUros Bizjak Subject: RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores On Thu, 14 Feb 2013, Gopalasubramanian, Ganesh wrote: Is it OK for wwdocs? Looks good to me if you say ...through the... options (adding the in two cases) and breaking the lines to not exceed 76 columns. Thanks, Gerald
[patch][wwwdocs] gcc 4.8 changes - AMD new cores
Hello, This patch adds short words about the new AMD cores that got enabled in GCC-4.8. OK for the wwwdocs? Regards Ganesh Index: gcc-4.8/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v retrieving revision 1.96 diff -u -r1.96 changes.html --- gcc-4.8/changes.html12 Feb 2013 16:33:58 - 1.96 +++ gcc-4.8/changes.html13 Feb 2013 08:24:53 - @@ -529,6 +529,10 @@ information. /li li Windows MinGW-w64 targets (code*-w64-mingw*/code) require at least r5437 from the Mingw-w64 trunk. /li +liSupport for new AMD family 15h processors (Steamroller core) is now available + through code-march=bdver3/code and code-mtune=bdver3/code options./li +liSupport for new AMD family 16h processors (Jaguar core) is now available + through code-march=btver2/code and code-mtune=btver2/code options./li /ul h3 id=frvFRV/h3
RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores
Is it OK for wwdocs? Index: gcc-4.8/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v retrieving revision 1.96 diff -u -r1.96 changes.html --- gcc-4.8/changes.html12 Feb 2013 16:33:58 - 1.96 +++ gcc-4.8/changes.html13 Feb 2013 08:24:53 - @@ -529,6 +529,10 @@ information. /li li Windows MinGW-w64 targets (code*-w64-mingw*/code) require at least r5437 from the Mingw-w64 trunk. /li +liSupport for new AMD family 15h processors (Steamroller core) is now available + through code-march=bdver3/code and code-mtune=bdver3/code options./li +liSupport for new AMD family 16h processors (Jaguar core) is now available + through code-march=btver2/code and code-mtune=btver2/code options./li /ul h3 id=frvFRV/h3 -Original Message- From: Mikael Morin [mailto:mikael.mo...@sfr.fr] Sent: Wednesday, February 13, 2013 6:38 PM To: Richard Biener Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org; ubizjak at gmail dot com (gcc-bugzi...@gcc.gnu.org); ger...@pfeifer.com Subject: Re: [patch][wwwdocs] gcc 4.8 changes - AMD new cores Le 13/02/2013 14:00, Richard Biener a écrit : Of course not. Next they'll add blver ... Sorry
RE: [PATCH, i386]: AMD bdver3 enablement
Thank Uros for the comments. The changes are committed to trunk http://gcc.gnu.org/viewcvs?view=revisionrevision=193548 http://gcc.gnu.org/viewcvs?view=revisionrevision=193549 Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Wednesday, November 14, 2012 4:15 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH, i386]: AMD bdver3 enablement On Wed, Nov 14, 2012 at 10:22 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: sseshuf replaces sselog in some insn patterns, but should be handled in the same way in *existing* .md files. Modifications done as per the comments. 1. Sseshuf is added along with sselog in existing md files. 2. sseshuf is handled in a separate pattern in bdver3.md Bootstrapping and make -k check passes. Ok for trunk? 2012-11-14 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com bdver3 Enablement * gcc/doc/extend.texi: Add details about bdver3. * gcc/doc/invoke.texi: Add details about bdver3. * config.gcc (i[34567]86-*-linux* | ...): Add bdver3. (case ${target}): Add bdver3. * config/i386/i386.h (TARGET_BDVER3): New definition. * config/i386/i386.md (define_attr cpu): Add bdver3. * config/i386/sse.md (sseshuf): New type attribute. * config/i386/athlon.md (sseshuf):Likewise. * config/i386/atom.md (sseshuf):Likewise. * config/i386/ppro.md (sseshuf):Likewise. * config/i386/bdver1.md (sseshuf):Likewise. * config/i386/i386.opt (flag_dispatch_scheduler): Add bdver3. * config/i386/i386-c.c (ix86_target_macros_internal): Add bdver3 def_and_undef * config/i386/driver-i386.c (host_detect_local_cpu): Let -march=native recognize bdver3 processors. * config/i386/i386.c (struct processor_costs bdver3_cost): New. (m_BDVER3): New definition. (m_AMD_MULTIPLE): Includes m_BDVER3. (initial_ix86_tune_features): Add bdver3 tune. (processor_target_table): Add bdver3 entry. (static const char *const cpu_names): Add bdver3 entry. (software_prefetching_beneficial_p): Add bdver3. (ix86_option_override_internal): Add bdver3 instruction sets. (ix86_option_override_internal): Remove XSAVEOPT for bdver1 and bdver2. (ix86_issue_rate): Add bdver3. (ix86_adjust_cost): Add bdver3. (enum target_cpu_default): Add TARGET_CPU_DEFAULT_bdver3. (enum processor_type): Add PROCESSOR_BDVER3. * config/i386/bdver3.md: New file describing bdver3 pipelines. OK for mainline. Thanks, Uros.
RE: [PATCH, i386]: AMD bdver3 enablement
You can see from the changes of sse.md that this is functionally a no-op change. Sseshuf replaces sselog. So, do you mean it should be added with sselog instead of sseadd? Adding it with sseadd (instead of sselog) influences the latency information. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Monday, November 12, 2012 2:30 AM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH, i386]: AMD bdver3 enablement On Fri, Nov 9, 2012 at 4:39 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: Changes done with respect to the review comments. Conditionally setting sseshuf type attribute has been removed. Instead new attribute is added and is included for other attribute calculations. The patch is attached as (difflog.txt). The new file (bdver3.md) describing the pipelines is also attached. Bootstrapping and make -k check passes. OK for upstream? 2012-11-09 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com bdver3 Enablement * gcc/doc/extend.texi: Add details about bdver3. * gcc/doc/invoke.texi: Add details about bdver3. * config.gcc (i[34567]86-*-linux* | ...): Add bdver3. (case ${target}): Add bdver3. * config/i386/i386.h (TARGET_BDVER3): New definition. * config/i386/i386.md (define_attr cpu): Add bdver3. * config/i386/sse.md (sseshuf): New type attribute. * config/i386/athlon.md (sseshuf):Likewise. * config/i386/atom.md (sseshuf):Likewise. * config/i386/ppro.md (sseshuf):Likewise. Index: gcc/config/i386/atom.md === --- gcc/config/i386/atom.md (revision 193132) +++ gcc/config/i386/atom.md (working copy) @@ -455,6 +455,30 @@ (eq_attr memory !none))) atom-simple-0) +(define_insn_reservation atom_sseshuf 1 + (and (eq_attr cpu atom) + (and (eq_attr type sseshuf) +(eq_attr memory none))) + atom-simple-either) + +(define_insn_reservation atom_sseshuf_mem 1 + (and (eq_attr cpu atom) + (and (eq_attr type sseshuf) +(eq_attr memory !none))) + atom-simple-either) + +(define_insn_reservation atom_sseshuf1 1 + (and (eq_attr cpu atom) + (and (eq_attr type sseshuf1) +(eq_attr memory none))) + atom-simple-0) + +(define_insn_reservation atom_sseshuf1_mem 1 + (and (eq_attr cpu atom) + (and (eq_attr type sseshuf1) +(eq_attr memory !none))) + atom-simple-0) + ;; not pmad, not psad (define_insn_reservation atom_sseiadd 1 (and (eq_attr cpu atom) This was not what I had in mind for changes in existing .md files. Just change them in this way: Index: atom.md === --- atom.md (revision 193407) +++ atom.md (working copy) @@ -594,7 +594,7 @@ ;; no memory simple (define_insn_reservation atom_sseadd 5 (and (eq_attr cpu atom) - (and (eq_attr type sseadd,sseadd1) + (and (eq_attr type sseadd,sseshuf,sseadd1,sseshuf1) (and (eq_attr memory none) (and (eq_attr mode !V2DF) (eq_attr atom_unit !complex) @@ -603,7 +603,7 @@ ;; memory simple (define_insn_reservation atom_sseadd_mem 5 (and (eq_attr cpu atom) - (and (eq_attr type sseadd,sseadd1) + (and (eq_attr type sseadd,sseshuf,sseadd1,sseshuf1) (and (eq_attr memory !none) (and (eq_attr mode !V2DF) (eq_attr atom_unit !complex) @@ -612,7 +612,7 @@ ;; maxps, minps, *pd, hadd, hsub (define_insn_reservation atom_sseadd_3 8 (and (eq_attr cpu atom) - (and (eq_attr type sseadd,sseadd1) + (and (eq_attr type sseadd,sseshuf,sseadd1,sseshuf1) (ior (eq_attr mode V2DF) (eq_attr atom_unit complex atom-complex, atom-all-eu*7) You can see from the changes of sse.md that this is functionally a no-op change. Uros.
RE: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon
Hi Jakub, We are working on the following. 1. bdver3 enablement. Review completed. Changes to be incorporated and checked-in. http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01131.html 2. btver2 basic enablement is done (http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01018.html)/ Scheduler descriptions are being updated. This is architecture specific and we consider it not to be a stage-1 material. Regards Ganesh -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Monday, October 29, 2012 11:27 PM To: g...@gcc.gnu.org Cc: gcc-patches@gcc.gnu.org Subject: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon Status == I'd like to close the stage 1 phase of GCC 4.8 development on Monday, November 5th. If you have still patches for new features you'd like to see in GCC 4.8, please post them for review soon. Patches posted before the freeze, but reviewed shortly after the freeze, may still go in, further changes should be just bugfixes and documentation fixes. Quality Data Priority # Change from Last Report --- --- P1 23 + 23 P2 77 + 8 P3 85 + 84 --- --- Total 185 +115 Previous Report === http://gcc.gnu.org/ml/gcc/2012-03/msg00011.html The next report will be sent by me again, announcing end of stage 1.
Add myself to MAINTAINERS
Adding myself to the list of members in write after approval. Index: ChangeLog === --- ChangeLog (revision 192977) +++ ChangeLog (working copy) @@ -1,3 +1,7 @@ +2012-10-30 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com + + * MAINTAINERS (Write After Approval): Add myself. + 2012-10-26 James Greenhalgh james.greenha...@arm.com * MAINTAINERS (Write After Approval): Add myself. Index: MAINTAINERS === --- MAINTAINERS (revision 192977) +++ MAINTAINERS (working copy) @@ -372,6 +372,7 @@ Chao-ying Fu f...@mips.com Gary Funck g...@intrepid.com Pompapathi V Gadad pompapathi.v.ga...@nsc.com +Gopalasubramanian Ganesh ganesh.gopalasubraman...@amd.com Kaveh Ghazigh...@gcc.gnu.org Matthew Gingellging...@gnat.com Tristan Gingoldging...@adacore.com Regards Ganesh
RE: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced
That was obvious. Sorry for the wrong commit. Thanks Jakub. -Ganesh -Original Message- From: Paolo Carlini [mailto:paolo.carl...@oracle.com] Sent: Wednesday, October 10, 2012 4:33 PM To: Jakub Jelinek Cc: Gopalasubramanian, Ganesh; Uros Bizjak; gcc-patches@gcc.gnu.org; veku...@gcc.gnu.org Subject: Re: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced On 10/10/2012 01:00 PM, Jakub Jelinek wrote: I have removed the extra line as obvious in SVN, to allow my bootstraps to continue. Thanks! Paolo.
RE: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced
Testing was done before posting the patch. It was successful. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Thursday, September 27, 2012 5:57 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced On Thu, Sep 27, 2012 at 10:30 AM, Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: This is a fix for PR 51109. There are three changes 1. Microcoded instructions are considered as single issue instructions and are therefore issued to a separate execution unit. 2. The multiplier unit is attached to execution unit 1 (ieu1). Since ieu is handled as a separate automaton in the patch, separate mult automaton is not required. 3. The integer execution units (2AGUs and 2EXs) are now decoupled. Now, they are described as separate automatons. Is it OK for upstream? Regards Ganesh 2012-09-27 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com PR 51109 * gcc/config/i386/bdver1.md (bdver1_int): Automaton has been split to reduce state transitions. OK for mainline, if tested according to [1]. [1] http://gcc.gnu.org/contribute.html#testing Thanks, Uros.
[PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced
Hi All, This is a fix for PR 51109. There are three changes 1. Microcoded instructions are considered as single issue instructions and are therefore issued to a separate execution unit. 2. The multiplier unit is attached to execution unit 1 (ieu1). Since ieu is handled as a separate automaton in the patch, separate mult automaton is not required. 3. The integer execution units (2AGUs and 2EXs) are now decoupled. Now, they are described as separate automatons. Is it OK for upstream? Regards Ganesh 2012-09-27 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com PR 51109 * gcc/config/i386/bdver1.md (bdver1_int): Automaton has been split to reduce state transitions. Index: gcc/config/i386/bdver1.md === --- gcc/config/i386/bdver1.md (revision 191658) +++ gcc/config/i386/bdver1.md (working copy) @@ -36,7 +36,7 @@ (define_attr bdver1_decode direct,vector,double (const_string direct)) -(define_automaton bdver1,bdver1_int,bdver1_load,bdver1_mult,bdver1_fp) +(define_automaton bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu) (define_cpu_unit bdver1-decode0 bdver1) (define_cpu_unit bdver1-decode1 bdver1) @@ -71,16 +71,14 @@ | (nothing,(bdver1-decode1 + bdver1-decode2 -(define_cpu_unit bdver1-ieu0 bdver1_int) -(define_cpu_unit bdver1-ieu1 bdver1_int) +(define_cpu_unit bdver1-ieu0 bdver1_ieu) +(define_cpu_unit bdver1-ieu1 bdver1_ieu) (define_reservation bdver1-ieu (bdver1-ieu0 | bdver1-ieu1)) -(define_cpu_unit bdver1-agu0 bdver1_int) -(define_cpu_unit bdver1-agu1 bdver1_int) +(define_cpu_unit bdver1-agu0 bdver1_agu) +(define_cpu_unit bdver1-agu1 bdver1_agu) (define_reservation bdver1-agu (bdver1-agu0 | bdver1-agu1)) -(define_cpu_unit bdver1-mult bdver1_mult) - (define_cpu_unit bdver1-load0 bdver1_load) (define_cpu_unit bdver1-load1 bdver1_load) (define_reservation bdver1-load bdver1-agu, @@ -93,6 +91,12 @@ ;; 128bit SSE instructions issue two stores at once. (define_reservation bdver1-store2 (bdver1-load0 + bdver1-load1)) +;; vectorpath (microcoded) instructions are single issue instructions. +;; So, they occupy all the integer units. +(define_reservation bdver1-ivector bdver1-ieu0+bdver1-ieu1+ + bdver1-agu0+bdver1-agu1+ + bdver1-load0+bdver1-load1) + ;; The FP operations start to execute at stage 12 in the pipeline, while ;; integer operations start to execute at stage 9 for athlon and 11 for K8 ;; Compensate the difference for athlon because it results in significantly @@ -125,7 +129,7 @@ (define_insn_reservation bdver1_call 0 (and (eq_attr cpu bdver1,bdver2) (eq_attr type call,callv)) -bdver1-double,bdver1-agu,bdver1-ieu) +bdver1-double,bdver1-agu) ;; PUSH mem is double path. (define_insn_reservation bdver1_push 1 (and (eq_attr cpu bdver1,bdver2) @@ -135,17 +139,17 @@ (define_insn_reservation bdver1_pop 1 (and (eq_attr cpu bdver1,bdver2) (eq_attr type pop)) -bdver1-direct,(bdver1-ieu+bdver1-load)) +bdver1-direct,bdver1-ivector) ;; LEAVE no latency info so far, assume same with amdfam10. (define_insn_reservation bdver1_leave 3 (and (eq_attr cpu bdver1,bdver2) (eq_attr type leave)) -bdver1-vector,(bdver1-ieu+bdver1-load)) +bdver1-vector,bdver1-ivector) ;; LEA executes in AGU unit with 1 cycle latency on BDVER1. (define_insn_reservation bdver1_lea 1 (and (eq_attr cpu bdver1,bdver2) (eq_attr type lea)) -bdver1-direct,bdver1-agu,nothing) +bdver1-direct,bdver1-agu) ;; MUL executes in special multiplier unit attached to IEU1. (define_insn_reservation bdver1_imul_DI 6 @@ -153,23 +157,23 @@ (and (eq_attr type imul) (and (eq_attr mode DI) (eq_attr memory none,unknown - bdver1-direct1,bdver1-ieu1,bdver1-mult,nothing,bdver1-ieu1) +bdver1-direct1,bdver1-ieu1) (define_insn_reservation bdver1_imul 4 (and (eq_attr cpu bdver1,bdver2) (and (eq_attr type imul) (eq_attr memory none,unknown))) -bdver1-direct1,bdver1-ieu1,bdver1-mult,bdver1-ieu1) +bdver1-direct1,bdver1-ieu1) (define_insn_reservation bdver1_imul_mem_DI 10 (and (eq_attr cpu bdver1,bdver2) (and (eq_attr type imul)
RE: [PATCH,i386] fma4 addition for bdver2
Hi, The second change (done in config/i386/driver-i386.c (host_detect_local_cpu)) is not reflected in svn revision 191109. Since we are handling the fma instruction selection in i386.c\i386.md, we need not disable the flag in driver. Let me know your opinion. Regards Ganesh -Original Message- From: Gopalasubramanian, Ganesh Sent: Wednesday, September 05, 2012 3:41 PM To: gcc-patches@gcc.gnu.org Cc: Uros Bizjak (ubiz...@gmail.com) Subject: [PATCH,i386] fma4 addition for bdver2 Hello, FMA4 and FMA3 ISA are implemented in bdver2 target. FMA3 is selected by default. This patch supports the use of FMA4 intrinsics for bdver2 targets. Is it OK for trunk? Regards Ganesh 2012-09-05 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/i386/i386.md : Comments on fma4 instruction selection reflect requirement on register pressure based cost model. * config/i386/driver-i386.c (host_detect_local_cpu): fma4 flag is set-reset as informed by the cpuid flag. * config/i386/i386.c (processor_alias_table): fma4 flag is enabled for bdver2. Index: gcc/config/i386/i386.md === --- gcc/config/i386/i386.md (revision 190830) +++ gcc/config/i386/i386.md (working copy) @@ -659,9 +659,11 @@ (eq_attr isa noavx2) (symbol_ref !TARGET_AVX2) (eq_attr isa bmi2) (symbol_ref TARGET_BMI2) (eq_attr isa fma) (symbol_ref TARGET_FMA) -;; Disable generation of FMA4 instructions for generic code -;; since FMA3 is preferred for targets that implement both -;; instruction sets. +;; Fma instruction selection has to be done based on +;; register pressure. For generating fma4, a cost model +;; based on register pressure is required. Till then, +;; fma4 instruction is disabled for targets that implement +;; both fma and fma4 instruction sets. (eq_attr isa fma4) (symbol_ref TARGET_FMA4 !TARGET_FMA) ] Index: gcc/config/i386/driver-i386.c === --- gcc/config/i386/driver-i386.c (revision 190830) +++ gcc/config/i386/driver-i386.c (working copy) @@ -483,8 +483,6 @@ has_abm = ecx bit_ABM; has_lwp = ecx bit_LWP; has_fma4 = ecx bit_FMA4; - if (vendor == SIG_AMD has_fma4 has_fma) - has_fma4 = 0; has_xop = ecx bit_XOP; has_tbm = ecx bit_TBM; has_lzcnt = ecx bit_LZCNT; Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 190830) +++ gcc/config/i386/i386.c (working copy) @@ -3164,7 +3164,7 @@ {bdver2, PROCESSOR_BDVER2, CPU_BDVER2, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1 - | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX + | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C | PTA_FMA}, {btver1, PROCESSOR_BTVER1, CPU_GENERIC64, Regards Ganesh
[PATCH,i386] fma4 addition for bdver2
Hello, FMA4 and FMA3 ISA are implemented in bdver2 target. FMA3 is selected by default. This patch supports the use of FMA4 intrinsics for bdver2 targets. Is it OK for trunk? Regards Ganesh 2012-09-05 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/i386/i386.md : Comments on fma4 instruction selection reflect requirement on register pressure based cost model. * config/i386/driver-i386.c (host_detect_local_cpu): fma4 flag is set-reset as informed by the cpuid flag. * config/i386/i386.c (processor_alias_table): fma4 flag is enabled for bdver2. Index: gcc/config/i386/i386.md === --- gcc/config/i386/i386.md (revision 190830) +++ gcc/config/i386/i386.md (working copy) @@ -659,9 +659,11 @@ (eq_attr isa noavx2) (symbol_ref !TARGET_AVX2) (eq_attr isa bmi2) (symbol_ref TARGET_BMI2) (eq_attr isa fma) (symbol_ref TARGET_FMA) -;; Disable generation of FMA4 instructions for generic code -;; since FMA3 is preferred for targets that implement both -;; instruction sets. +;; Fma instruction selection has to be done based on +;; register pressure. For generating fma4, a cost model +;; based on register pressure is required. Till then, +;; fma4 instruction is disabled for targets that implement +;; both fma and fma4 instruction sets. (eq_attr isa fma4) (symbol_ref TARGET_FMA4 !TARGET_FMA) ] Index: gcc/config/i386/driver-i386.c === --- gcc/config/i386/driver-i386.c (revision 190830) +++ gcc/config/i386/driver-i386.c (working copy) @@ -483,8 +483,6 @@ has_abm = ecx bit_ABM; has_lwp = ecx bit_LWP; has_fma4 = ecx bit_FMA4; - if (vendor == SIG_AMD has_fma4 has_fma) - has_fma4 = 0; has_xop = ecx bit_XOP; has_tbm = ecx bit_TBM; has_lzcnt = ecx bit_LZCNT; Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 190830) +++ gcc/config/i386/i386.c (working copy) @@ -3164,7 +3164,7 @@ {bdver2, PROCESSOR_BDVER2, CPU_BDVER2, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1 - | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX + | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C | PTA_FMA}, {btver1, PROCESSOR_BTVER1, CPU_GENERIC64, Regards Ganesh
RE: [PATCH,i386] fma,fma4 and xop flags
This won't work, since we have to prefer FMA3 also in case when only -mfma -mfma4 without -mtune=XX is used. We can add TARGET_FMA_BOTH though, but I doubt there will ever be target that implements both insn sets without preferences. Preferring FMA3 over FMA4 might not do good always. For instance, with increased register pressure FMA3 can be used. But, when we have more registers at our disposal, fma4 if used might do good by avoiding extra reload. IMO, when preference of FMA instructions is adjudged by register pressure, we may need some functionality to support that. So, ideally for bdver2, we like to have both fma and fma4 getting generated with options -mfma -mfma4. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Tuesday, August 14, 2012 9:12 PM To: Richard Henderson Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] fma,fma4 and xop flags On Mon, Aug 13, 2012 at 9:50 PM, Richard Henderson r...@redhat.com wrote: On 08/13/2012 12:33 PM, Uros Bizjak wrote: AFAIU fma3 is better than fma4 for bdver2 (the only CPU that implements both FMA sets). Current description of bdver2 doesn't even enable fma4 in processor_alias_table due to this fact. The change you are referring to adds preference for fma3 insn set for generic code (not FMA4 builtins!), even when fma4 is enabled. So, no matter which combination and sequence of -mfmfa -mfma4 or -mxop user passes to the compiler, only fma3 instructions will be generated. This rationale needs to appear as a comment above + (eq_attr isa fma4) +(symbol_ref TARGET_FMA4 !TARGET_FMA) I plan to commit following patch: --cut here-- Index: i386.md === --- i386.md (revision 190362) +++ i386.md (working copy) @@ -659,6 +659,9 @@ (eq_attr isa noavx2) (symbol_ref !TARGET_AVX2) (eq_attr isa bmi2) (symbol_ref TARGET_BMI2) (eq_attr isa fma) (symbol_ref TARGET_FMA) +;; Disable generation of FMA4 instructions for generic code +;; since FMA3 is preferred for targets that implement both +;; instruction sets. (eq_attr isa fma4) (symbol_ref TARGET_FMA4 !TARGET_FMA) ] --cut here-- Longer term we may well require some sort of (TARGET_FMA4 !(TARGET_FMA TARGET_PREFER_FMA3)) with an appropriate entry in ix86_tune_features to match. This won't work, since we have to prefer FMA3 also in case when only -mfma -mfma4 without -mtune=XX is used. We can add TARGET_FMA_BOTH though, but I doubt there will ever be target that implements both insn sets without preferences. Uros.
[PATCH,i386] cpuid function for prefetchw
Hello, To get the prefetchw cpuid flag, cpuid function 0x8001 needs to be called. Previous to patch, function 0x7 is called. Bootstrapping and make -k check passes without failures. Ok for trunk? Regards Ganesh 2012-08-13 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com PR driver/54210 * config/i386/driver-i386.c (host_detect_local_cpu): Call cpuid function 0x8001 to get the prfchw cpuid flag. Index: gcc/config/i386/driver-i386.c === --- gcc/config/i386/driver-i386.c (revision 189996) +++ gcc/config/i386/driver-i386.c (working copy) @@ -467,7 +467,6 @@ has_bmi2 = ebx bit_BMI2; has_fsgsbase = ebx bit_FSGSBASE; has_rdseed = ebx bit_RDSEED; - has_prfchw = ecx bit_PRFCHW; } /* Check cpuid level of extended features. */ @@ -491,6 +490,7 @@ has_longmode = edx bit_LM; has_3dnowp = edx bit_3DNOWP; has_3dnow = edx bit_3DNOW; + has_prfchw = ecx bit_PRFCHW; } if (!arch
RE: [PATCH,i386] cpuid function for prefetchw
Yes! Thanks Jakub. -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Monday, August 13, 2012 3:16 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] cpuid function for prefetchw On Mon, Aug 13, 2012 at 09:29:45AM +, Gopalasubramanian, Ganesh wrote: To get the prefetchw cpuid flag, cpuid function 0x8001 needs to be called. Previous to patch, function 0x7 is called. Bootstrapping and make -k check passes without failures. Ok for trunk? IMHO you move it to a wrong spot, ecx bits of CPUID 0x8001 are tested earlier. So I think you want this instead (bootstrap/regtest in progress): 2012-08-13 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com Jakub Jelinek ja...@redhat.com PR driver/54210 * config/i386/driver-i386.c (host_detect_local_cpu): Test bit_PRFCHW bit of CPUID 0x8001 %ecx instead of CPUID 7 %ecx. * config/i386/cpuid.h (bits_PRFCHW): Move definition to CPUID 0x8001 %ecx flags. --- gcc/config/i386/driver-i386.c.jj2012-08-10 15:49:25.0 +0200 +++ gcc/config/i386/driver-i386.c 2012-08-13 11:30:14.570494736 +0200 @@ -467,7 +467,6 @@ const char *host_detect_local_cpu (int a has_bmi2 = ebx bit_BMI2; has_fsgsbase = ebx bit_FSGSBASE; has_rdseed = ebx bit_RDSEED; - has_prfchw = ecx bit_PRFCHW; has_adx = ebx bit_ADX; } @@ -488,6 +487,7 @@ const char *host_detect_local_cpu (int a has_xop = ecx bit_XOP; has_tbm = ecx bit_TBM; has_lzcnt = ecx bit_LZCNT; + has_prfchw = ecx bit_PRFCHW; has_longmode = edx bit_LM; has_3dnowp = edx bit_3DNOWP; --- gcc/config/i386/cpuid.h.jj 2012-08-10 15:49:25.0 +0200 +++ gcc/config/i386/cpuid.h 2012-08-13 11:31:30.346494092 +0200 @@ -52,6 +52,7 @@ #define bit_LAHF_LM(1 0) #define bit_ABM(1 5) #define bit_SSE4a (1 6) +#define bit_PRFCHW (1 8) #define bit_XOP (1 11) #define bit_LWP(1 15) #define bit_FMA4(1 16) @@ -69,7 +70,6 @@ #define bit_HLE(1 4) #define bit_AVX2 (1 5) #define bit_BMI2 (1 8) -#define bit_PRFCHW (1 8) #define bit_RTM(1 11) #define bit_RDSEED (1 18) #define bit_ADX(1 19) Jakub
RE: [PATCH,i386] fma,fma4 and xop flags
Thank you Uros, Richard! I will confirm the test results in couple off days. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Saturday, August 11, 2012 3:54 AM To: Richard Henderson Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] fma,fma4 and xop flags On Fri, Aug 10, 2012 at 10:02 PM, Richard Henderson r...@redhat.com wrote: On 2012-08-10 12:59, Uros Bizjak wrote: Actually, this is the problem you are trying to solve. The fma4 patterns are defined before fma3, so gcc prefers these. The Real Problem is that they should not be separate patterns. They should be a single pattern that selects alternatives via the enabled isa. 2012-08-11 Uros Bizjak ubiz...@gmail.com * config/i386/i386.md (isa): Add fma and fma4. (enabled): Handle fma and fma4. * config/i386/sse.md (*fma_fmadd_mode): Merge *fma4_fmadd_mode. (*fma_fmsub_mode): Merge *fma4_fmsub_mode. (*fma_fnmadd_mode): Merge *fma4_fnmadd_mode. (*fma_fnmsub_mode): Merge *fma4_fnmsub_mode. (*fma_fmaddsub_mode): Merge *fma4_fmaddsub_mode. (*fma_fmsubadd_mode): Merge *fma4_fmsubadd_mode. Tested on x86_64-pc-linux-gnu {,-m32}, committed to mainline SVN. I will wait a couple of days before backporting patches to 4.7, so please Ganesh, test mainline if everything is OK. BTW: With this patch, we can enable PTA_FMA4 for bdver2 target. Uros.
RE: [PATCH,i386] fma,fma4 and xop flags
-mxop implies -mfma4, but reverse is not true. I think this handling went in for bdver1. But, with bdver2, we have both fma and fma4. So for bdver2, -mxop should not be enabling one of them. if someone set -mfma4 together with -mfma on the command line, we should NOT disable selected ISA behind user's back If both -mfma4 and -mfma are enabled, GCC outputs fma4 instructions. This, I think is because fma4 instruction patterns are read before fma instruction patterns from the .md files. So, enabling both -mfma4 and -mfma is not good for bdver2. Moreover, if user tries to use, -mfma -mno-fma4 -mxop, the order in which these options are used becomes crucial. -mxop enables -mfma4 and by instruction patterns fma4 instructions gets listed in the assembly file. For the below test, double a,b,c,d; int fn(){ a = b + c * d ; return a; } #1) Using options -O2 -mno-fma4 -mfma -mxop outputs fma4. (vfmaddsdb(%rip), %xmm2, %xmm1, %xmm0) #2) Using options -O2 -mfma -mno-fma4 -mxop outputs fma4. (vfmaddsdb(%rip), %xmm2, %xmm1, %xmm0) #3) Using options -mxop -mno-fma4 -mfma outpts fma. (vfmadd132sd d(%rip), %xmm1, %xmm0) As we see the order in which the options are used becomes crucial. This is confusing. I haven't really tested other implied options. But, I suspect similar phenomenon in those cases too. IMO, we can directly go by the CPUID flags and enable the flags. This will be a one to one mapping and leave the user with lot more liberty. Please let me know your opinion. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Friday, August 10, 2012 1:21 AM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] fma,fma4 and xop flags On Wed, Aug 8, 2012 at 1:31 PM, ganesh.gopalasubraman...@amd.com wrote: Bdver2 cpu supports both fma and fma4 instructions. Previous to patch, option -mno-xop removes -mfma4. Similarly, option -mno-fma4 removes -mxop. It looks to me that there is some misunderstanding. AFAICS: -mxop implies -mfma4, but reverse is not true. Please see #define OPTION_MASK_ISA_FMA4_SET \ (OPTION_MASK_ISA_FMA4 | OPTION_MASK_ISA_SSE4A_SET \ | OPTION_MASK_ISA_AVX_SET) #define OPTION_MASK_ISA_XOP_SET \ (OPTION_MASK_ISA_XOP | OPTION_MASK_ISA_FMA4_SET) So, -mxop sets -mfma4, etc ..., but -mfma4 does NOT enable -mxop. OTOH, #define OPTION_MASK_ISA_FMA4_UNSET \ (OPTION_MASK_ISA_FMA4 | OPTION_MASK_ISA_XOP_UNSET) #define OPTION_MASK_ISA_XOP_UNSET OPTION_MASK_ISA_XOP -mno-fma4 implies -mno-xop, but again reverse is not true. Thus, -mno-xop does NOT imply -mno-fma4. So, the patch conditionally disables -mfma or -mfma4. Enabling -mxop is done by also checking -mfma. Please note that conditional handling of ISA flags belongs to ix86_option_override_internal. However, if someone set -mfma4 together with -mfma on the command line, we should NOT disable selected ISA behind user's back, in the same way as we don't disable anything with -march=i386 -msse4. With -march=bdver2, we already marked that only fma is supported, and if user selected -march=bdver2 -mfma4 on the command line, we shouldn't disable anything. Uros.
RE: [PATCH,i386] fma,fma4 and xop flags
Otherwise, what does -mno-fma4 -mxop do? (it should enable both xop and fma4!) what should -mfma4 -mno-xop do (it should disable both xop and fma4!). Yes! that's what GCC does now. Some flags are coupled (atleast for now). For ex, -mno-sse4.2 -mavx enables both sse4.2 and avx whereas -mavx -mno-sse4.2 disables both. Setting of the following are clubbed. 1) 3DNow sets MMX 2) SSE2 sets SSE 3) SSE3 sets SSE2 4) SSE4_1 sets SSE3 5) SSE4_2 sets SSE4_1 6) FMA sets AVX 7) AVX2 sets AVX 8) SSE4_A sets SSE3 9) FMA4 set SSE4_A and AVX 10) XOP sets FMA4 11) AES sets SSE2 12) PCLMUL sets SSE2 13) ABM sets POPCNT Resetting is done in reversely (MMX resets 3DNOW). IMO, if we have different cpuid flags, enabling\disabling the compiler flags depends on these cpuid flags directly. Adding subsets to them or tangling them together may give wrong results. Please let me know your opinion. Regards Ganesh -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, August 08, 2012 5:12 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; ubiz...@gmail.com Subject: Re: [PATCH,i386] fma,fma4 and xop flags On Wed, Aug 8, 2012 at 1:31 PM, ganesh.gopalasubraman...@amd.com wrote: Hello, Bdver2 cpu supports both fma and fma4 instructions. Previous to patch, option -mno-xop removes -mfma4. Similarly, option -mno-fma4 removes -mxop. Eh? Why's that? I think we should disentangle -mxop and -mfma4 instead. Otherwise, what does -mno-fma4 -mxop do? (it should enable both xop and fma4!) what should -mfma4 -mno-xop do (it should disable both xop and fma4!). All this is just confusing to the user, even if in AMD documents XOP includes FMA4. Richard.
Backport: fma3 instruction generation for 'march=native' in AMD processors
Hello, Below is the patch that has been committed in trunk (Revision: 187075). We like to backport it to GCC 4.7 branch as couple of AMD processors require this change for fma3 instruction generation. Bootstrapping and testing are successful. Is it OK to commit in GCC 4.7 branch? Regards Ganesh PATCH = * config/i386/driver-i386.c (host_detect_local_cpu): Reset has_fma4 for AMD processors with both fma3 and fma4 support. Index: config/i386/driver-i386.c === --- config/i386/driver-i386.c (revision 186897) +++ config/i386/driver-i386.c (working copy) @@ -472,6 +472,8 @@ has_abm = ecx bit_ABM; has_lwp = ecx bit_LWP; has_fma4 = ecx bit_FMA4; + if (vendor == SIG_AMD has_fma4 has_fma) + has_fma4 = 0; has_xop = ecx bit_XOP; has_tbm = ecx bit_TBM; has_lzcnt = ecx bit_LZCNT;
Re: [PATCH] [i386] fma3 instruction generation for 'march=native' in AMD processors
I have added the ChangeLog and modified the patch. Is it OK to commit to trunk? Regards Ganesh 2012-05-03 Ganesh Gopalasubramanian ganesh.gopalasubraman...@amd.com * config/i386/driver-i386.c (host_detect_local_cpu): Reset has_fma4 for AMD processors with both fma3 and fma4 support. Index: config/i386/driver-i386.c === --- config/i386/driver-i386.c (revision 186897) +++ config/i386/driver-i386.c (working copy) @@ -472,6 +472,8 @@ has_abm = ecx bit_ABM; has_lwp = ecx bit_LWP; has_fma4 = ecx bit_FMA4; + if (vendor == SIG_AMD has_fma4 has_fma) + has_fma4 = 0; has_xop = ecx bit_XOP; has_tbm = ecx bit_TBM; has_lzcnt = ecx bit_LZCNT; -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Wednesday, May 02, 2012 5:11 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] [i386] fma3 instruction generation for 'march=native' in AMD processors On Wed, May 02, 2012 at 11:12:33AM +, Gopalasubramanian, Ganesh wrote: For AMD architectures with both fma3 and fma4 instructions' support, GCC generates fma4 by default. Instead, we like to generate fma3 instruction. Below patch enables the fma3 instruction generation for -march=native. Ok for trunk? You haven't provided ChangeLog entry. Index: gcc/config/i386/driver-i386.c === --- gcc/config/i386/driver-i386.c (revision 186897) +++ gcc/config/i386/driver-i386.c (working copy) @@ -472,6 +472,10 @@ has_abm = ecx bit_ABM; has_lwp = ecx bit_LWP; has_fma4 = ecx bit_FMA4; + if (((vendor == SIG_AMD)) (has_fma4) (has_fma)) +{ +has_fma4 = 0; +} And the formatting of this is wrong, 4 unnecessary pairs of (), one unnecessary {} pair, bad indentation of the has_fma4 = 0; assignment (should use a tab). Jakub
[PATCH] [i386] fma3 instruction generation for 'march=native' in AMD processors
For AMD architectures with both fma3 and fma4 instructions' support, GCC generates fma4 by default. Instead, we like to generate fma3 instruction. Below patch enables the fma3 instruction generation for -march=native. Ok for trunk? Index: gcc/config/i386/driver-i386.c === --- gcc/config/i386/driver-i386.c (revision 186897) +++ gcc/config/i386/driver-i386.c (working copy) @@ -472,6 +472,10 @@ has_abm = ecx bit_ABM; has_lwp = ecx bit_LWP; has_fma4 = ecx bit_FMA4; + if (((vendor == SIG_AMD)) (has_fma4) (has_fma)) +{ +has_fma4 = 0; +} has_xop = ecx bit_XOP; has_tbm = ecx bit_TBM; has_lzcnt = ecx bit_LZCNT; Regards Ganesh