Re: [PATCH 3/9][GCC][AArch64] Add autovectorization support for Complex instructions

2022-11-16 Thread Ramana Radhakrishnan via Gcc-patches


>
> OK with the comment typos fixed.
>
> > > > gcc/ChangeLog:
> > > >
> > > > 2018-11-11  Tamar Christina  
> > > >
> > > > * config/aarch64/aarch64-simd.md (aarch64_fcadd,
> > > > fcadd3, aarch64_fcmla,
> > > > fcmla4): New.
> > > > * config/aarch64/aarch64.h (TARGET_COMPLEX): New.
> > > > * config/aarch64/iterators.md (UNSPEC_FCADD90, UNSPEC_FCADD270,
> > > > UNSPEC_FCMLA, UNSPEC_FCMLA90, UNSPEC_FCMLA180, 
> > > > UNSPEC_FCMLA270): New.
> > > > (FCADD, FCMLA): New.
> > > > (rot, rotsplit1, rotsplit2): New.
> > > > * config/arm/types.md (neon_fcadd, neon_fcmla): New.
>
> Should we push these to an existing class for now, and split them later when
> someone provides a scheduling model which makes use of them?

Sorry about a blast from the past :P . Perhaps these should just have
been added to is_neon_type that allows for the belts and braces
handling in cortex-a57.md which is the basis for many a scheduler
description in the backend :)

Ramana


Re: [PATCH 3/9][GCC][AArch64] Add autovectorization support for Complex instructions

2018-11-28 Thread James Greenhalgh
On Mon, Nov 12, 2018 at 06:31:45AM -0600, Tamar Christina wrote:
> Hi Kyrill,
> 
> > Hi Tamar,
> > 
> > On 11/11/18 10:26, Tamar Christina wrote:
> > > Hi All,
> > >
> > > This patch adds the expander support for supporting autovectorization of 
> > > complex number operations
> > > such as Complex addition with a rotation along the Argand plane.  This 
> > > also adds support for complex
> > > FMA.
> > >
> > > The instructions are described in the ArmARM [1] and are available from 
> > > Armv8.3-a onwards.
> > >
> > > Concretely, this generates
> > >
> > > f90:
> > > mov x3, 0
> > > .p2align 3,,7
> > > .L2:
> > > ldr q0, [x0, x3]
> > > ldr q1, [x1, x3]
> > > fcadd   v0.2d, v0.2d, v1.2d, #90
> > > str q0, [x2, x3]
> > > add x3, x3, 16
> > > cmp x3, 3200
> > > bne .L2
> > > ret
> > >
> > > now instead of
> > >
> > > f90:
> > > mov x4, x1
> > > mov x1, x2
> > > add x3, x4, 31
> > > add x2, x0, 31
> > > sub x3, x3, x1
> > > sub x2, x2, x1
> > > cmp x3, 62
> > > mov x3, 62
> > > ccmpx2, x3, 0, hi
> > > bls .L5
> > > mov x2, x4
> > > add x3, x0, 3200
> > > .p2align 3,,7
> > > .L3:
> > > ld2 {v2.2d - v3.2d}, [x0], 32
> > > ld2 {v4.2d - v5.2d}, [x2], 32
> > > cmp x0, x3
> > > fsubv0.2d, v2.2d, v5.2d
> > > faddv1.2d, v4.2d, v3.2d
> > > st2 {v0.2d - v1.2d}, [x1], 32
> > > bne .L3
> > > ret
> > > .L5:
> > > add x6, x0, 8
> > > add x5, x4, 8
> > > add x2, x1, 8
> > > mov x3, 0
> > > .p2align 3,,7
> > > .L2:
> > > ldr d1, [x0, x3]
> > > ldr d3, [x5, x3]
> > > ldr d0, [x6, x3]
> > > ldr d2, [x4, x3]
> > > fsubd1, d1, d3
> > > faddd0, d0, d2
> > > str d1, [x1, x3]
> > > str d0, [x2, x3]
> > > add x3, x3, 16
> > > cmp x3, 3200
> > > bne .L2
> > > ret
> > >
> > > For complex additions with a 90* rotation along the Argand plane.
> > >
> > > [1] 
> > > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
> > >
> > > Bootstrap and Regtest on aarch64-none-linux-gnu, arm-none-gnueabihf and 
> > > x86_64-pc-linux-gnu
> > > are still on going but previous patch showed no regressions.
> > >
> > > The instructions have also been tested on aarch64-none-elf and 
> > > arm-none-eabi on a Armv8.3-a model
> > > and -march=Armv8.3-a+fp16 and all tests pass.
> > >
> > > Ok for trunk?

OK with the comment typos fixed.

> > > gcc/ChangeLog:
> > >
> > > 2018-11-11  Tamar Christina  
> > >
> > > * config/aarch64/aarch64-simd.md (aarch64_fcadd,
> > > fcadd3, aarch64_fcmla,
> > > fcmla4): New.
> > > * config/aarch64/aarch64.h (TARGET_COMPLEX): New.
> > > * config/aarch64/iterators.md (UNSPEC_FCADD90, UNSPEC_FCADD270,
> > > UNSPEC_FCMLA, UNSPEC_FCMLA90, UNSPEC_FCMLA180, UNSPEC_FCMLA270): 
> > > New.
> > > (FCADD, FCMLA): New.
> > > (rot, rotsplit1, rotsplit2): New.
> > > * config/arm/types.md (neon_fcadd, neon_fcmla): New.

Should we push these to an existing class for now, and split them later when
someone provides a scheduling model which makes use of them?

> > diff --git a/gcc/config/aarch64/aarch64-simd.md 
> > b/gcc/config/aarch64/aarch64-simd.md
> > index 
> > c4be3101fdec930707918106cd7c53cf7584553e..12a91183a98ea23015860c77a97955cb1b30bfbb
> >  100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -419,6 +419,63 @@
> >   }
> >   )
> >   
> > +;; The fcadd and fcmla patterns are made UNSPEC for the explicitly due to 
> > the

s/for the explicitly/explicitly

> > +;; fact that their usage need to guarantee that the source vectors are

s/need/needs

> > +;; contiguous.  It would be wrong to describe the operation without being 
> > able
> > +;; to describe the permute that is also required, but even if that is done
> > +;; the permute would have been created as a LOAD_LANES which means the 
> > values
> > +;; in the registers are in the wrong order.
> > +(define_insn "aarch64_fcadd"
> > +  [(set (match_operand:VHSDF 0 "register_operand" "=w")
> > +   (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")
> > +  (match_operand:VHSDF 2 "register_operand" "w")]
> > +  FCADD))]
> > +  "TARGET_COMPLEX"
> > +  "fcadd\t%0., %1., %2., #"
> > +  [(set_attr "type" "neon_fcadd")]
> > +)



Re: [PATCH 3/9][GCC][AArch64] Add autovectorization support for Complex instructions

2018-11-12 Thread Tamar Christina
Hi Kyrill,

> Hi Tamar,
> 
> On 11/11/18 10:26, Tamar Christina wrote:
> > Hi All,
> >
> > This patch adds the expander support for supporting autovectorization of 
> > complex number operations
> > such as Complex addition with a rotation along the Argand plane.  This also 
> > adds support for complex
> > FMA.
> >
> > The instructions are described in the ArmARM [1] and are available from 
> > Armv8.3-a onwards.
> >
> > Concretely, this generates
> >
> > f90:
> > mov x3, 0
> > .p2align 3,,7
> > .L2:
> > ldr q0, [x0, x3]
> > ldr q1, [x1, x3]
> > fcadd   v0.2d, v0.2d, v1.2d, #90
> > str q0, [x2, x3]
> > add x3, x3, 16
> > cmp x3, 3200
> > bne .L2
> > ret
> >
> > now instead of
> >
> > f90:
> > mov x4, x1
> > mov x1, x2
> > add x3, x4, 31
> > add x2, x0, 31
> > sub x3, x3, x1
> > sub x2, x2, x1
> > cmp x3, 62
> > mov x3, 62
> > ccmpx2, x3, 0, hi
> > bls .L5
> > mov x2, x4
> > add x3, x0, 3200
> > .p2align 3,,7
> > .L3:
> > ld2 {v2.2d - v3.2d}, [x0], 32
> > ld2 {v4.2d - v5.2d}, [x2], 32
> > cmp x0, x3
> > fsubv0.2d, v2.2d, v5.2d
> > faddv1.2d, v4.2d, v3.2d
> > st2 {v0.2d - v1.2d}, [x1], 32
> > bne .L3
> > ret
> > .L5:
> > add x6, x0, 8
> > add x5, x4, 8
> > add x2, x1, 8
> > mov x3, 0
> > .p2align 3,,7
> > .L2:
> > ldr d1, [x0, x3]
> > ldr d3, [x5, x3]
> > ldr d0, [x6, x3]
> > ldr d2, [x4, x3]
> > fsubd1, d1, d3
> > faddd0, d0, d2
> > str d1, [x1, x3]
> > str d0, [x2, x3]
> > add x3, x3, 16
> > cmp x3, 3200
> > bne .L2
> > ret
> >
> > For complex additions with a 90* rotation along the Argand plane.
> >
> > [1] 
> > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
> >
> > Bootstrap and Regtest on aarch64-none-linux-gnu, arm-none-gnueabihf and 
> > x86_64-pc-linux-gnu
> > are still on going but previous patch showed no regressions.
> >
> > The instructions have also been tested on aarch64-none-elf and 
> > arm-none-eabi on a Armv8.3-a model
> > and -march=Armv8.3-a+fp16 and all tests pass.
> >
> > Ok for trunk?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 2018-11-11  Tamar Christina  
> >
> > * config/aarch64/aarch64-simd.md (aarch64_fcadd,
> > fcadd3, aarch64_fcmla,
> > fcmla4): New.
> > * config/aarch64/aarch64.h (TARGET_COMPLEX): New.
> > * config/aarch64/iterators.md (UNSPEC_FCADD90, UNSPEC_FCADD270,
> > UNSPEC_FCMLA, UNSPEC_FCMLA90, UNSPEC_FCMLA180, UNSPEC_FCMLA270): 
> > New.
> > (FCADD, FCMLA): New.
> > (rot, rotsplit1, rotsplit2): New.
> > * config/arm/types.md (neon_fcadd, neon_fcmla): New.
> >
> > -- 
> 
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> c4be3101fdec930707918106cd7c53cf7584553e..12a91183a98ea23015860c77a97955cb1b30bfbb
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -419,6 +419,63 @@
>   }
>   )
>   
> +;; The fcadd and fcmla patterns are made UNSPEC for the explicitly due to the
> +;; fact that their usage need to guarantee that the source vectors are
> +;; contiguous.  It would be wrong to describe the operation without being 
> able
> +;; to describe the permute that is also required, but even if that is done
> +;; the permute would have been created as a LOAD_LANES which means the values
> +;; in the registers are in the wrong order.
> +(define_insn "aarch64_fcadd"
> +  [(set (match_operand:VHSDF 0 "register_operand" "=w")
> + (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")
> +(match_operand:VHSDF 2 "register_operand" "w")]
> +FCADD))]
> +  "TARGET_COMPLEX"
> +  "fcadd\t%0., %1., %2., #"
> +  [(set_attr "type" "neon_fcadd")]
> +)
> +
> +(define_expand "fcadd3"
> +  [(set (match_operand:VHSDF 0 "register_operand")
> + (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")
> +(match_operand:VHSDF 2 "register_operand")]
> +FCADD))]
> +  "TARGET_COMPLEX"
> +{
> +  emit_insn (gen_aarch64_fcadd (operands[0], operands[1],
> +operands[2]));
> +  DONE;
> +})
> +
> +(define_insn "aarch64_fcmla"
> +  [(set (match_operand:VHSDF 0 "register_operand" "=w")
> + (plus:VHSDF (match_operand:VHSDF 1 "register_operand" "0")
> + (unspec:VHSDF [(match_operand:VHSDF 2 "register_operand" 
> "w")
> +

Re: [PATCH 3/9][GCC][AArch64] Add autovectorization support for Complex instructions

2018-11-12 Thread Kyrill Tkachov

Hi Tamar,

On 11/11/18 10:26, Tamar Christina wrote:

Hi All,

This patch adds the expander support for supporting autovectorization of 
complex number operations
such as Complex addition with a rotation along the Argand plane.  This also 
adds support for complex
FMA.

The instructions are described in the ArmARM [1] and are available from 
Armv8.3-a onwards.

Concretely, this generates

f90:
mov x3, 0
.p2align 3,,7
.L2:
ldr q0, [x0, x3]
ldr q1, [x1, x3]
fcadd   v0.2d, v0.2d, v1.2d, #90
str q0, [x2, x3]
add x3, x3, 16
cmp x3, 3200
bne .L2
ret

now instead of

f90:
mov x4, x1
mov x1, x2
add x3, x4, 31
add x2, x0, 31
sub x3, x3, x1
sub x2, x2, x1
cmp x3, 62
mov x3, 62
ccmpx2, x3, 0, hi
bls .L5
mov x2, x4
add x3, x0, 3200
.p2align 3,,7
.L3:
ld2 {v2.2d - v3.2d}, [x0], 32
ld2 {v4.2d - v5.2d}, [x2], 32
cmp x0, x3
fsubv0.2d, v2.2d, v5.2d
faddv1.2d, v4.2d, v3.2d
st2 {v0.2d - v1.2d}, [x1], 32
bne .L3
ret
.L5:
add x6, x0, 8
add x5, x4, 8
add x2, x1, 8
mov x3, 0
.p2align 3,,7
.L2:
ldr d1, [x0, x3]
ldr d3, [x5, x3]
ldr d0, [x6, x3]
ldr d2, [x4, x3]
fsubd1, d1, d3
faddd0, d0, d2
str d1, [x1, x3]
str d0, [x2, x3]
add x3, x3, 16
cmp x3, 3200
bne .L2
ret

For complex additions with a 90* rotation along the Argand plane.

[1] 
https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile

Bootstrap and Regtest on aarch64-none-linux-gnu, arm-none-gnueabihf and 
x86_64-pc-linux-gnu
are still on going but previous patch showed no regressions.

The instructions have also been tested on aarch64-none-elf and arm-none-eabi on 
a Armv8.3-a model
and -march=Armv8.3-a+fp16 and all tests pass.

Ok for trunk?

Thanks,
Tamar

gcc/ChangeLog:

2018-11-11  Tamar Christina  

* config/aarch64/aarch64-simd.md (aarch64_fcadd,
fcadd3, aarch64_fcmla,
fcmla4): New.
* config/aarch64/aarch64.h (TARGET_COMPLEX): New.
* config/aarch64/iterators.md (UNSPEC_FCADD90, UNSPEC_FCADD270,
UNSPEC_FCMLA, UNSPEC_FCMLA90, UNSPEC_FCMLA180, UNSPEC_FCMLA270): New.
(FCADD, FCMLA): New.
(rot, rotsplit1, rotsplit2): New.
* config/arm/types.md (neon_fcadd, neon_fcmla): New.

--



diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
c4be3101fdec930707918106cd7c53cf7584553e..12a91183a98ea23015860c77a97955cb1b30bfbb
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -419,6 +419,63 @@
 }
 )
 
+;; The fcadd and fcmla patterns are made UNSPEC for the explicitly due to the

+;; fact that their usage need to guarantee that the source vectors are
+;; contiguous.  It would be wrong to describe the operation without being able
+;; to describe the permute that is also required, but even if that is done
+;; the permute would have been created as a LOAD_LANES which means the values
+;; in the registers are in the wrong order.
+(define_insn "aarch64_fcadd"
+  [(set (match_operand:VHSDF 0 "register_operand" "=w")
+   (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")
+  (match_operand:VHSDF 2 "register_operand" "w")]
+  FCADD))]
+  "TARGET_COMPLEX"
+  "fcadd\t%0., %1., %2., #"
+  [(set_attr "type" "neon_fcadd")]
+)
+
+(define_expand "fcadd3"
+  [(set (match_operand:VHSDF 0 "register_operand")
+   (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")
+  (match_operand:VHSDF 2 "register_operand")]
+  FCADD))]
+  "TARGET_COMPLEX"
+{
+  emit_insn (gen_aarch64_fcadd (operands[0], operands[1],
+  operands[2]));
+  DONE;
+})
+
+(define_insn "aarch64_fcmla"
+  [(set (match_operand:VHSDF 0 "register_operand" "=w")
+   (plus:VHSDF (match_operand:VHSDF 1 "register_operand" "0")
+   (unspec:VHSDF [(match_operand:VHSDF 2 "register_operand" 
"w")
+  (match_operand:VHSDF 3 "register_operand" 
"w")]
+  FCMLA)))]
+  "TARGET_COMPLEX"
+  "fcmla\t%0., %2., %3., #"
+  [(set_attr "type" "neon_fcmla")]
+)
+
+;; The complex mla operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "fcmla4"
+  [(set (match_operand:VHSDF 0 "register_operand")
+   (plus:VHSDF (match_operand:VHSDF 1 

[PATCH 3/9][GCC][AArch64] Add autovectorization support for Complex instructions

2018-11-11 Thread Tamar Christina
Hi All,

This patch adds the expander support for supporting autovectorization of 
complex number operations
such as Complex addition with a rotation along the Argand plane.  This also 
adds support for complex
FMA.

The instructions are described in the ArmARM [1] and are available from 
Armv8.3-a onwards.

Concretely, this generates

f90:
mov x3, 0
.p2align 3,,7
.L2:
ldr q0, [x0, x3]
ldr q1, [x1, x3]
fcadd   v0.2d, v0.2d, v1.2d, #90
str q0, [x2, x3]
add x3, x3, 16
cmp x3, 3200
bne .L2
ret

now instead of

f90:
mov x4, x1
mov x1, x2
add x3, x4, 31
add x2, x0, 31
sub x3, x3, x1
sub x2, x2, x1
cmp x3, 62
mov x3, 62
ccmpx2, x3, 0, hi
bls .L5
mov x2, x4
add x3, x0, 3200
.p2align 3,,7
.L3:
ld2 {v2.2d - v3.2d}, [x0], 32
ld2 {v4.2d - v5.2d}, [x2], 32
cmp x0, x3
fsubv0.2d, v2.2d, v5.2d
faddv1.2d, v4.2d, v3.2d
st2 {v0.2d - v1.2d}, [x1], 32
bne .L3
ret
.L5:
add x6, x0, 8
add x5, x4, 8
add x2, x1, 8
mov x3, 0
.p2align 3,,7
.L2:
ldr d1, [x0, x3]
ldr d3, [x5, x3]
ldr d0, [x6, x3]
ldr d2, [x4, x3]
fsubd1, d1, d3
faddd0, d0, d2
str d1, [x1, x3]
str d0, [x2, x3]
add x3, x3, 16
cmp x3, 3200
bne .L2
ret

For complex additions with a 90* rotation along the Argand plane.

[1] 
https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile

Bootstrap and Regtest on aarch64-none-linux-gnu, arm-none-gnueabihf and 
x86_64-pc-linux-gnu
are still on going but previous patch showed no regressions.

The instructions have also been tested on aarch64-none-elf and arm-none-eabi on 
a Armv8.3-a model
and -march=Armv8.3-a+fp16 and all tests pass.

Ok for trunk?

Thanks,
Tamar

gcc/ChangeLog:

2018-11-11  Tamar Christina  

* config/aarch64/aarch64-simd.md (aarch64_fcadd,
fcadd3, aarch64_fcmla,
fcmla4): New.
* config/aarch64/aarch64.h (TARGET_COMPLEX): New.
* config/aarch64/iterators.md (UNSPEC_FCADD90, UNSPEC_FCADD270,
UNSPEC_FCMLA, UNSPEC_FCMLA90, UNSPEC_FCMLA180, UNSPEC_FCMLA270): New.
(FCADD, FCMLA): New.
(rot, rotsplit1, rotsplit2): New.
* config/arm/types.md (neon_fcadd, neon_fcmla): New.

-- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index c4be3101fdec930707918106cd7c53cf7584553e..12a91183a98ea23015860c77a97955cb1b30bfbb 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -419,6 +419,63 @@
 }
 )
 
+;; The fcadd and fcmla patterns are made UNSPEC for the explicitly due to the
+;; fact that their usage need to guarantee that the source vectors are
+;; contiguous.  It would be wrong to describe the operation without being able
+;; to describe the permute that is also required, but even if that is done
+;; the permute would have been created as a LOAD_LANES which means the values
+;; in the registers are in the wrong order.
+(define_insn "aarch64_fcadd"
+  [(set (match_operand:VHSDF 0 "register_operand" "=w")
+	(unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")
+		   (match_operand:VHSDF 2 "register_operand" "w")]
+		   FCADD))]
+  "TARGET_COMPLEX"
+  "fcadd\t%0., %1., %2., #"
+  [(set_attr "type" "neon_fcadd")]
+)
+
+(define_expand "fcadd3"
+  [(set (match_operand:VHSDF 0 "register_operand")
+	(unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")
+		   (match_operand:VHSDF 2 "register_operand")]
+		   FCADD))]
+  "TARGET_COMPLEX"
+{
+  emit_insn (gen_aarch64_fcadd (operands[0], operands[1],
+	   operands[2]));
+  DONE;
+})
+
+(define_insn "aarch64_fcmla"
+  [(set (match_operand:VHSDF 0 "register_operand" "=w")
+	(plus:VHSDF (match_operand:VHSDF 1 "register_operand" "0")
+		(unspec:VHSDF [(match_operand:VHSDF 2 "register_operand" "w")
+   (match_operand:VHSDF 3 "register_operand" "w")]
+   FCMLA)))]
+  "TARGET_COMPLEX"
+  "fcmla\t%0., %2., %3., #"
+  [(set_attr "type" "neon_fcmla")]
+)
+
+;; The complex mla operations always need to expand to two instructions.
+;; The first operation does half the computation and the second does the
+;; remainder.  Because of this, expand early.
+(define_expand "fcmla4"
+  [(set (match_operand:VHSDF 0 "register_operand")
+	(plus:VHSDF (match_operand:VHSDF 1 "register_operand")
+		(unspec:VHSDF [(match_operand:VHSDF 2 "register_operand")
+   (match_operand:VHSDF 3 "register_operand")]
+   FCMLA)))]
+  "TARGET_COMPLEX"
+{
+  emit_insn (gen_aarch64_fcmla (operands[0], operands[1],
+		 operands[2],