Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor

2018-02-19 Thread Wilco Dijkstra
Siddhesh Poyarekar wrote:
> On Thursday 15 February 2018 07:50 PM, Wilco Dijkstra wrote:
>> So it seems to me using existing cost mechanisms is always preferable, even 
>> if you
>> currently can't differentiate between loads and stores.
>
> Luis is working on address cost adjustments among other things, so I
> guess the path of least resistance for gcc8 is to have those adjustments
> go in and then figure out how much improvement this patch (or separating
> loads and stores) would get on top of that.  Would that be acceptable?

Yes adjusting costs is not an issue as that's clearly something we need to 
improve
anyway. Having numbers for both approaches would be useful, however I think 
it's still
best to go with the cost approach for GCC8 as that should get most of the gain.

Wilco

Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor

2018-02-15 Thread Siddhesh Poyarekar
On Thursday 15 February 2018 07:50 PM, Wilco Dijkstra wrote:
> So it seems to me using existing cost mechanisms is always preferable, even 
> if you
> currently can't differentiate between loads and stores.

Luis is working on address cost adjustments among other things, so I
guess the path of least resistance for gcc8 is to have those adjustments
go in and then figure out how much improvement this patch (or separating
loads and stores) would get on top of that.  Would that be acceptable?

Siddhesh


Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor

2018-02-15 Thread Wilco Dijkstra
Hi Siddhesh,

I still don't like the idea of disabling a whole class of instructions in the 
md file.
It seems much better to adjust the costs here so that you get most of the
improvement now, and fine tune it once we can differentiate between
loads and stores.

Taking your example, adding -funroll-loops generates this for Falkor:

ldr q7, [x2, x18]
add x5, x18, 16
add x4, x1, x18
add x10, x18, 32
add x11, x1, x5
add x3, x18, 48
add x12, x1, x10
add x9, x18, 64
add x14, x1, x3
add x8, x18, 80
add x15, x1, x9
add x7, x18, 96
add x16, x1, x8
str q7, [x4]
ldr q16, [x2, x5]
add x6, x18, 112
add x17, x1, x7
add x18, x18, 128
add x5, x1, x6
cmp x18, x13
str q16, [x11]
ldr q17, [x2, x10]
str q17, [x12]
ldr q18, [x2, x3]
str q18, [x14]
ldr q19, [x2, x9]
str q19, [x15]
ldr q20, [x2, x8]
str q20, [x16]
ldr q21, [x2, x7]
str q21, [x17]
ldr q22, [x2, x6]
str q22, [x5]
bne .L25

If you adjust costs however you'd get this:

.L25:
ldr q7, [x14]
add x14, x14, 128
add x4, x4, 128
str q7, [x4, -128]
ldr q16, [x14, -112]
str q16, [x4, -112]
ldr q17, [x14, -96]
str q17, [x4, -96]
ldr q18, [x14, -80]
str q18, [x4, -80]
ldr q19, [x14, -64]
str q19, [x4, -64]
ldr q20, [x14, -48]
str q20, [x4, -48]
ldr q21, [x14, -32]
str q21, [x4, -32]
ldr q22, [x14, -16]
cmp x14, x9
str q22, [x4, -16]
bne .L25

So it seems to me using existing cost mechanisms is always preferable, even if 
you
currently can't differentiate between loads and stores.

Wilco

[PING][PATCH v3] Disable reg offset in quad-word store for Falkor

2018-02-15 Thread Siddhesh Poyarekar
Ping!

On Friday 09 February 2018 01:02 PM, Siddhesh Poyarekar wrote:
> Hi,
> 
> Here's v3 of the patch to disable register offset addressing mode for
> stores of 128-bit values on Falkor because they're very costly.
> Following Kyrill's suggestion, I compared the codegen for a53 and
> found that the codegen was quite different.  Jim's original patch is
> the most minimal compromise for this and is also a cleaner temporary
> fix before I attempt to split address costs into loads and stores for
> gcc9.
> 
> So v3 is essentially a very slightly cleaned up version of v1 again,
> this time with confirmation that there are no codegen changes in
> CPU2017 on non-falkor builds; only the codegen for -mcpu=falkor is
> different.
> 
> 
> 
> On Falkor, because of an idiosyncracy of how the pipelines are
> designed, a quad-word store using a reg+reg addressing mode is almost
> twice as slow as an add followed by a quad-word store with a single
> reg addressing mode.  So we get better performance if we disallow
> addressing modes using register offsets with quad-word stores.  This
> is the most minimal change for gcc8, I will volunteer to make a more
> lasting change for gcc9 where I split the addressing mode costs into
> loads and stores wherever possible and needed.
> 
> This patch improves fpspeed by 0.17% and intspeed by 0.62% in CPU2017,
> with xalancbmk_s (3.84%) wrf_s (1.46%) and mcf_s (1.62%) being the
> biggest winners.  There were no regressions beyond 0.4%.
> 
> 2018-xx-xx  Jim Wilson  
> Kugan Vivenakandarajah  
>   Siddhesh Poyarekar  
> 
>   gcc/
>   * config/aarch64/aarch64-protos.h (aarch64_movti_target_operand_p):
>   New.
>   * config/aarch64/aarch64-simd.md (aarch64_simd_mov): Use Utf.
>   * config/aarch64/aarch64-tuning-flags.def
>   (SLOW_REGOFFSET_QUADWORD_STORE): New.
>   * config/aarch64/aarch64.c (qdf24xx_tunings): Add
>   SLOW_REGOFFSET_QUADWORD_STORE to tuning flags.
>   (aarch64_movti_target_operand_p): New.
>   * config/aarch64/aarch64.md (movti_aarch64): Use Utf.
>   (movtf_aarch64): Likewise.
>   * config/aarch64/constraints.md (Utf): New.
> 
>   gcc/testsuite
>   * gcc.target/aarch64/pr82533.c: New test case.
> ---
>  gcc/config/aarch64/aarch64-protos.h |  1 +
>  gcc/config/aarch64/aarch64-simd.md  |  4 ++--
>  gcc/config/aarch64/aarch64-tuning-flags.def |  4 
>  gcc/config/aarch64/aarch64.c| 14 +-
>  gcc/config/aarch64/aarch64.md   |  8 
>  gcc/config/aarch64/constraints.md   |  6 ++
>  gcc/testsuite/gcc.target/aarch64/pr82533.c  | 11 +++
>  7 files changed, 41 insertions(+), 7 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr82533.c
> 
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index cda2895d28e..5a0323deb1e 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -433,6 +433,7 @@ bool aarch64_simd_mem_operand_p (rtx);
>  bool aarch64_sve_ld1r_operand_p (rtx);
>  bool aarch64_sve_ldr_operand_p (rtx);
>  bool aarch64_sve_struct_memory_operand_p (rtx);
> +bool aarch64_movti_target_operand_p (rtx);
>  rtx aarch64_simd_vect_par_cnst_half (machine_mode, int, bool);
>  rtx aarch64_tls_get_addr (void);
>  tree aarch64_fold_builtin (tree, int, tree *, bool);
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 3d1f6a01cb7..f7daac3e28d 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -131,9 +131,9 @@
>  
>  (define_insn "*aarch64_simd_mov"
>[(set (match_operand:VQ 0 "nonimmediate_operand"
> - "=w, Umq,  m,  w, ?r, ?w, ?r, w")
> + "=w, Umq, Utf,  w, ?r, ?w, ?r, w")
>   (match_operand:VQ 1 "general_operand"
> - "m,  Dz, w,  w,  w,  r,  r, Dn"))]
> + "m,  Dz,w,  w,  w,  r,  r, Dn"))]
>"TARGET_SIMD
> && (register_operand (operands[0], mode)
> || aarch64_simd_reg_or_zero (operands[1], mode))"
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index ea9ead234cb..04baf5b6de6 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -41,4 +41,8 @@ AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", 
> SLOW_UNALIGNED_LDPW)
> are not considered cheap.  */
>  AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
>  
> +/* Don't use a register offset in a memory address for a quad-word store.  */
> +AARCH64_EXTRA_TUNING_OPTION ("slow_regoffset_quadword_store",
> +  SLOW_REGOFFSET_QUADWORD_STORE)
> +
>  #undef AARCH64_EXTRA_TUNING_OPTION
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index