Re: [Patch AArch64] Stop generating BSL for simple integer code

James Greenhalgh Tue, 14 Nov 2017 06:13:29 -0800

On Wed, Oct 04, 2017 at 05:44:07PM +0100, James Greenhalgh wrote:
> 
> On Thu, Jul 27, 2017 at 06:49:01PM +0100, James Greenhalgh wrote:
> >
> > On Mon, Jun 12, 2017 at 02:44:40PM +0100, James Greenhalgh wrote:
> > > [Sorry for the re-send. I spotted that the attributes were not right for 
> > > the
> > >  new pattern I was adding. The change between this and the first version 
> > > was:
> > >
> > >   +  [(set_attr "type" "neon_bsl,neon_bsl,neon_bsl,multiple")
> > >   +   (set_attr "length" "4,4,4,12")]
> > > ]
> > >
> > > ---
> > >
> > > Hi,
> > >
> > > In this testcase, all argument registers and the return register
> > > will be general purpose registers:
> > >
> > >   long long
> > >   foo (long long a, long long b, long long c)
> > >   {
> > >     return ((a ^ b) & c) ^ b;
> > >   }
> > >
> > > However, due to the implementation of aarch64_simd_bsl<mode>_internal
> > > we'll match that pattern and emit a BSL, necessitating moving all those
> > > arguments and results to the Advanced SIMD registers:
> > >
> > >   fmov    d2, x0
> > >   fmov    d0, x2
> > >   fmov    d1, x1
> > >   bsl     v0.8b, v2.8b, v1.8b
> > >   fmov    x0, d0
> > >
> > > To fix this, we turn aarch64_simd_bsldi_internal in to an insn_and_split 
> > > that
> > > knows to split back to integer operations if the register allocation
> > > falls that way.
> > >
> > > We could have used an unspec, but then we lose some of the nice
> > > simplifications that can be made from explicitly spelling out the 
> > > semantics
> > > of BSL.
> >
> > Off list, Richard and I found considerable issues with this patch. From
> > idioms failing to match in the testcase, to subtle register allocation bugs,
> > to potential for suboptimal code generation.
> >
> > That's not good!
> >
> > This spin of the patch corrects those issues by adding a second split
> > pattern, allocating a temporary register if we're permitted to, or
> > properly using tied output operands if we're not, and by generally playing
> > things a bit safer around the possibility of register overlaps.
> >
> > The testcase is expanded to an execute test, hopefully giving a little
> > more assurance that we're doing the right thing - now testing the BSL idiom
> > generation in both general-purpose and floating-point registers and 
> > comparing
> > the results. Hopefully we're now a bit more robust!
> >
> > Bootstrapped and tested on aarch64-none-linux-gnu and cross-tested on
> > aarch64-none-elf with no issues.
> >
> > OK for trunk?
> 
> I'd like to ping this patch for review (I would like the explicit review as
> an earlier revision had bugs), attached is a rebase of this patch on trunk.
> 
> OK? If so, please commit it on my behalf as I will be out of office for a
> few weeks for some time off.


IMHO, This has had plenty of time for feedback from other AArch64 maintainers.
I've used my maintainer privileges and pushed this as r254727 after a
successful bootstrap and test cycle.

Thanks,
James

> gcc/
> 
> 2017-10-04  James Greenhalgh  <james.greenha...@arm.com>
> 
>       * config/aarch64/aarch64-simd.md
>       (aarch64_simd_bsl<mode>_internal): Remove DImode.
>       (*aarch64_simd_bsl<mode>_alt): Likewise.
>       (aarch64_simd_bsldi_internal): New.
>       (aarch64_simd_bsldi_alt): Likewise.
> 
> gcc/testsuite/
> 
> 2017-10-04  James Greenhalgh  <james.greenha...@arm.com>
> 
>       * gcc.target/aarch64/bsl-idiom.c: New.
>       * gcc.target/aarch64/copysign-bsl.c: New.
> 

> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 70e9339..1888d2f 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -2322,13 +2322,13 @@
>  ;; in *aarch64_simd_bsl<mode>_alt.
>  
>  (define_insn "aarch64_simd_bsl<mode>_internal"
> -  [(set (match_operand:VSDQ_I_DI 0 "register_operand" "=w,w,w")
> -     (xor:VSDQ_I_DI
> -        (and:VSDQ_I_DI
> -          (xor:VSDQ_I_DI
> +  [(set (match_operand:VDQ_I 0 "register_operand" "=w,w,w")
> +     (xor:VDQ_I
> +        (and:VDQ_I
> +          (xor:VDQ_I
>              (match_operand:<V_INT_EQUIV> 3 "register_operand" "w,0,w")
> -            (match_operand:VSDQ_I_DI 2 "register_operand" "w,w,0"))
> -          (match_operand:VSDQ_I_DI 1 "register_operand" "0,w,w"))
> +            (match_operand:VDQ_I 2 "register_operand" "w,w,0"))
> +          (match_operand:VDQ_I 1 "register_operand" "0,w,w"))
>         (match_dup:<V_INT_EQUIV> 3)
>       ))]
>    "TARGET_SIMD"
> @@ -2346,14 +2346,14 @@
>  ;; permutations of commutative operations, we have to have a separate 
> pattern.
>  
>  (define_insn "*aarch64_simd_bsl<mode>_alt"
> -  [(set (match_operand:VSDQ_I_DI 0 "register_operand" "=w,w,w")
> -     (xor:VSDQ_I_DI
> -        (and:VSDQ_I_DI
> -          (xor:VSDQ_I_DI
> -            (match_operand:VSDQ_I_DI 3 "register_operand" "w,w,0")
> -            (match_operand:VSDQ_I_DI 2 "register_operand" "w,0,w"))
> -           (match_operand:VSDQ_I_DI 1 "register_operand" "0,w,w"))
> -       (match_dup:VSDQ_I_DI 2)))]
> +  [(set (match_operand:VDQ_I 0 "register_operand" "=w,w,w")
> +     (xor:VDQ_I
> +        (and:VDQ_I
> +          (xor:VDQ_I
> +            (match_operand:VDQ_I 3 "register_operand" "w,w,0")
> +            (match_operand:<V_INT_EQUIV> 2 "register_operand" "w,0,w"))
> +           (match_operand:VDQ_I 1 "register_operand" "0,w,w"))
> +       (match_dup:<V_INT_EQUIV> 2)))]
>    "TARGET_SIMD"
>    "@
>    bsl\\t%0.<Vbtype>, %3.<Vbtype>, %2.<Vbtype>
> @@ -2362,6 +2362,100 @@
>    [(set_attr "type" "neon_bsl<q>")]
>  )
>  
> +;; DImode is special, we want to avoid computing operations which are
> +;; more naturally computed in general purpose registers in the vector
> +;; registers.  If we do that, we need to move all three operands from general
> +;; purpose registers to vector registers, then back again.  However, we
> +;; don't want to make this pattern an UNSPEC as we'd lose scope for
> +;; optimizations based on the component operations of a BSL.
> +;;
> +;; That means we need a splitter back to the individual operations, if they
> +;; would be better calculated on the integer side.
> +
> +(define_insn_and_split "aarch64_simd_bsldi_internal"
> +  [(set (match_operand:DI 0 "register_operand" "=w,w,w,&r")
> +     (xor:DI
> +        (and:DI
> +          (xor:DI
> +            (match_operand:DI 3 "register_operand" "w,0,w,r")
> +            (match_operand:DI 2 "register_operand" "w,w,0,r"))
> +          (match_operand:DI 1 "register_operand" "0,w,w,r"))
> +       (match_dup:DI 3)
> +     ))]
> +  "TARGET_SIMD"
> +  "@
> +  bsl\\t%0.8b, %2.8b, %3.8b
> +  bit\\t%0.8b, %2.8b, %1.8b
> +  bif\\t%0.8b, %3.8b, %1.8b
> +  #"
> +  "&& GP_REGNUM_P (REGNO (operands[0]))"
> +  [(match_dup 1) (match_dup 1) (match_dup 2) (match_dup 3)]
> +{
> +  /* Split back to individual operations.  If we're before reload, and
> +     able to create a temporary register, do so.  If we're after reload,
> +     we've got an early-clobber destination register, so use that.
> +     Otherwise, we can't create pseudos and we can't yet guarantee that
> +     operands[0] is safe to write, so FAIL to split.  */
> +
> +  rtx scratch;
> +  if (reload_completed)
> +    scratch = operands[0];
> +  else if (can_create_pseudo_p ())
> +    scratch = gen_reg_rtx (DImode);
> +  else
> +    FAIL;
> +
> +  emit_insn (gen_xordi3 (scratch, operands[2], operands[3]));
> +  emit_insn (gen_anddi3 (scratch, scratch, operands[1]));
> +  emit_insn (gen_xordi3 (operands[0], scratch, operands[3]));
> +  DONE;
> +}
> +  [(set_attr "type" "neon_bsl,neon_bsl,neon_bsl,multiple")
> +   (set_attr "length" "4,4,4,12")]
> +)
> +
> +(define_insn_and_split "aarch64_simd_bsldi_alt"
> +  [(set (match_operand:DI 0 "register_operand" "=w,w,w,&r")
> +     (xor:DI
> +        (and:DI
> +          (xor:DI
> +            (match_operand:DI 3 "register_operand" "w,w,0,r")
> +            (match_operand:DI 2 "register_operand" "w,0,w,r"))
> +          (match_operand:DI 1 "register_operand" "0,w,w,r"))
> +       (match_dup:DI 2)
> +     ))]
> +  "TARGET_SIMD"
> +  "@
> +  bsl\\t%0.8b, %3.8b, %2.8b
> +  bit\\t%0.8b, %3.8b, %1.8b
> +  bif\\t%0.8b, %2.8b, %1.8b
> +  #"
> +  "&& GP_REGNUM_P (REGNO (operands[0]))"
> +  [(match_dup 0) (match_dup 1) (match_dup 2) (match_dup 3)]
> +{
> +  /* Split back to individual operations.  If we're before reload, and
> +     able to create a temporary register, do so.  If we're after reload,
> +     we've got an early-clobber destination register, so use that.
> +     Otherwise, we can't create pseudos and we can't yet guarantee that
> +     operands[0] is safe to write, so FAIL to split.  */
> +
> +  rtx scratch;
> +  if (reload_completed)
> +    scratch = operands[0];
> +  else if (can_create_pseudo_p ())
> +    scratch = gen_reg_rtx (DImode);
> +  else
> +    FAIL;
> +
> +  emit_insn (gen_xordi3 (scratch, operands[2], operands[3]));
> +  emit_insn (gen_anddi3 (scratch, scratch, operands[1]));
> +  emit_insn (gen_xordi3 (operands[0], scratch, operands[2]));
> +  DONE;
> +}
> +  [(set_attr "type" "neon_bsl,neon_bsl,neon_bsl,multiple")
> +   (set_attr "length" "4,4,4,12")]
> +)
> +
>  (define_expand "aarch64_simd_bsl<mode>"
>    [(match_operand:VALLDIF 0 "register_operand")
>     (match_operand:<V_INT_EQUIV> 1 "register_operand")
> diff --git a/gcc/testsuite/gcc.target/aarch64/bsl-idiom.c 
> b/gcc/testsuite/gcc.target/aarch64/bsl-idiom.c
> new file mode 100644
> index 0000000..8151387
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/bsl-idiom.c
> @@ -0,0 +1,88 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-rtl-combine --save-temps" } */
> +
> +/* Test that we don't generate BSL when in DImode with values in integer
> +   registers, and do generate it where we have values in floating-point
> +   registers.  This is useful, as it allows us to avoid register moves
> +   in the general case.
> +
> +   We want:
> +     eor     x0, x0, x1
> +     and     x0, x0, x2
> +     eor     x0, x0, x1
> +     ret
> +
> +   Rather than:
> +     fmov    d2, x0
> +     fmov    d0, x2
> +     fmov    d1, x1
> +     bsl     v0.8b, v2.8b, v1.8b
> +     fmov    x0, d0
> +     ret  */
> +
> +extern void abort (void);
> +
> +unsigned long long __attribute__ ((noinline))
> +foo (unsigned long long a, unsigned long long b, unsigned long long c)
> +{
> +  return ((a ^ b) & c) ^ b;
> +}
> +
> +unsigned long long __attribute__ ((noinline))
> +foo2 (unsigned long long a, unsigned long long b, unsigned long long c)
> +{
> +  return ((a ^ b) & c) ^ a;
> +}
> +
> +#define force_simd(V1)   asm volatile ("mov %d0, %1.d[0]"    \
> +        : "=w"(V1)                                           \
> +        : "w"(V1)                                            \
> +        : /* No clobbers */);
> +
> +unsigned long long __attribute__ ((noinline))
> +bar (unsigned long long a, unsigned long long b, unsigned long long c)
> +{
> +  force_simd (a);
> +  force_simd (b);
> +  force_simd (c);
> +  c = ((a ^ b) & c) ^ b;
> +  force_simd (c);
> +  return c;
> +}
> +
> +unsigned long long __attribute__ ((noinline))
> +bar2 (unsigned long long a, unsigned long long b, unsigned long long c)
> +{
> +  force_simd (a);
> +  force_simd (b);
> +  force_simd (c);
> +  c = ((a ^ b) & c) ^ a;
> +  force_simd (c);
> +  return c;
> +}
> +
> +int
> +main (int argc, char** argv)
> +{
> +  unsigned long long a = 0x0123456789abcdefULL;
> +  unsigned long long b = 0xfedcba9876543210ULL;
> +  unsigned long long c = 0xaabbccddeeff7777ULL;
> +  if (foo (a, b, c) != bar (a, b, c))
> +    abort ();
> +  if (foo2 (a, b, c) != bar2 (a, b, c))
> +    abort ();
> +  return 0;
> +}
> +
> +/* 2 BSL, 6 FMOV (to floating-point registers), and 2 FMOV (to general
> +purpose registers) for the "bar" tests, which should still use BSL.  */
> +/* { dg-final { scan-assembler-times "bsl\tv\[0-9\]" 2 } } */
> +/* { dg-final { scan-assembler-times "fmov\td\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "fmov\tx\[0-9\]" 2 } } */
> +
> +/* { dg-final { scan-assembler-not "bif\tv\[0-9\]" } } */
> +/* { dg-final { scan-assembler-not "bit\tv\[0-9\]" } } */
> +
> +/* We always match the idiom during combine.  */
> +/* { dg-final { scan-rtl-dump-times "aarch64_simd_bsldi_internal" 2 
> "combine" } } */
> +/* { dg-final { scan-rtl-dump-times "aarch64_simd_bsldi_alt" 2 "combine" } } 
> */
> diff --git a/gcc/testsuite/gcc.target/aarch64/copysign-bsl.c 
> b/gcc/testsuite/gcc.target/aarch64/copysign-bsl.c
> new file mode 100644
> index 0000000..4e63511
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/copysign-bsl.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +/* Test that we can generate DImode BSL when we are using
> +   copysign.  */
> +
> +double
> +foo (double a, double b)
> +{
> +  return __builtin_copysign (a, b);
> +}
> +
> +/* { dg-final { scan-assembler "bsl\tv\[0-9\]" } } */

Re: [Patch AArch64] Stop generating BSL for simple integer code

Reply via email to