Re: [SLP/AArch64] Fix unpack handling for big-endian SVE

Richard Biener Fri, 26 Jan 2018 06:59:02 -0800

On Fri, Jan 26, 2018 at 3:21 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> I hadn't realised that on big-endian targets, VEC_UNPACK*HI_EXPR unpacks
> the low-numbered lanes and VEC_UNPACK*LO_EXPR unpacks the high-numbered
> lanes.  This meant that both the SVE patterns and the handling of
> fully-masked loops were wrong.
>
> The patch deals with that by making sure that all vec_unpack* optabs
> are define_expands, using BYTES_BIG_ENDIAN to choose the appropriate
> define_insn.  This in turn meant that we can get rid of the duplication
> between the signed and unsigned patterns for predicates.  (We provide
> implementations of both the signed and unsigned optabs because the sign
> doesn't matter for predicates: every element contains only one
> significant bit.)
>
> Also, the float unpacks need to unpack one half of the input vector,
> but the unpacked upper bits are "don't care".  There are two obvious
> ways of handling that: use an unpack (filling with zeros) or use a ZIP
> (filling with a duplicate of the low bits).  The code previously used
> unpacks, but the sequence involved a subreg that is semantically an
> element reverse on big-endian targets.  Using the ZIP patterns avoids
> that, and at the moment there's no reason to prefer one over the other
> for performance reasons, so the patch switches to ZIP unconditionally.
> As the comment says, it would be easy to optimise this later if UUNPK
> turns out to be better for some implementations.
>
> Tested on aarch64_be-elf, aarch64-linux-gnu and x86_64-linux-gnu.
> I think the tree-vect-loop-manip.c part is obvious, but OK for the
> AArch64 bits?
>
> Richard
>
>
> 2018-01-26  Richard Sandiford  <richard.sandif...@linaro.org>
>
> gcc/
>         * tree-vect-loop-manip.c (vect_maybe_permute_loop_masks):
>         Reverse the choice between VEC_UNPACK_LO_EXPR and VEC_UNPACK_HI_EXPR
>         for big-endian.
>         * config/aarch64/iterators.md (hi_lanes_optab): New int attribute.
>         * config/aarch64/aarch64-sve.md
>         (*aarch64_sve_<perm_insn><perm_hilo><mode>): Rename to...
>         (aarch64_sve_<perm_insn><perm_hilo><mode>): ...this.
>         (*extend<mode><Vwide>2): Rename to...
>         (aarch64_sve_extend<mode><Vwide>2): ...this.
>         (vec_unpack<su>_<perm_hilo>_<mode>): Turn into a define_expand,
>         renaming the old pattern to...
>         (aarch64_sve_punpk<perm_hilo>_<mode>): ...this.  Only define
>         unsigned packs.
>         (vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>): Turn into a
>         define_expand, renaming the old pattern to...
>         (aarch64_sve_<su>unpk<perm_hilo>_<SVE_BHSI:mode>): ...this.
>         (*vec_unpacku_<perm_hilo>_<mode>_no_convert): Delete.
>         (vec_unpacks_<perm_hilo>_<mode>): Take BYTES_BIG_ENDIAN into
>         account when deciding which SVE instruction the optab should use.
>         (vec_unpack<su_optab>_float_<perm_hilo>_vnx4si): Likewise.
>
> gcc/testsuite/
>         * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Expect zips rather
>         than unpacks.
>         * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise.
>         * gcc.target/aarch64/sve/unpack_float_1.c: Likewise.
>
> Index: gcc/tree-vect-loop-manip.c
> ===================================================================
> --- gcc/tree-vect-loop-manip.c  2018-01-13 18:02:00.948360274 +0000
> +++ gcc/tree-vect-loop-manip.c  2018-01-26 14:00:15.717916957 +0000
> @@ -334,7 +334,8 @@ vect_maybe_permute_loop_masks (gimple_se
>         {
>           tree src = src_rgm->masks[i / 2];
>           tree dest = dest_rgm->masks[i];
> -         tree_code code = (i & 1 ? VEC_UNPACK_HI_EXPR
> +         tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
> +                           ? VEC_UNPACK_HI_EXPR
>                             : VEC_UNPACK_LO_EXPR);


This looks broken anyways -- the element oder on GIMPLE is the same
for little and big-endian.  I see two other BYTES_BIG_ENDIAN uses in
tree-vect-*, not sure when they crept in but they are all broken.

Richard.

>           gassign *stmt;
>           if (dest_masktype == unpack_masktype)
> Index: gcc/config/aarch64/iterators.md
> ===================================================================
> --- gcc/config/aarch64/iterators.md     2018-01-13 18:01:26.106685901 +0000
> +++ gcc/config/aarch64/iterators.md     2018-01-26 14:00:15.716916999 +0000
> @@ -1674,6 +1674,15 @@ (define_int_attr perm_hilo [(UNSPEC_ZIP1
>                             (UNSPEC_UNPACKSHI "hi") (UNSPEC_UNPACKUHI "hi")
>                             (UNSPEC_UNPACKSLO "lo") (UNSPEC_UNPACKULO "lo")])
>
> +;; Return true if the associated optab refers to the high-numbered lanes,
> +;; false if it refers to the low-numbered lanes.  The convention is for
> +;; "hi" to refer to the low-numbered lanes (the first ones in memory)
> +;; for big-endian.
> +(define_int_attr hi_lanes_optab [(UNSPEC_UNPACKSHI "!BYTES_BIG_ENDIAN")
> +                                (UNSPEC_UNPACKUHI "!BYTES_BIG_ENDIAN")
> +                                (UNSPEC_UNPACKSLO "BYTES_BIG_ENDIAN")
> +                                (UNSPEC_UNPACKULO "BYTES_BIG_ENDIAN")])
> +
>  (define_int_attr frecp_suffix  [(UNSPEC_FRECPE "e") (UNSPEC_FRECPX "x")])
>
>  (define_int_attr crc_variant [(UNSPEC_CRC32B "crc32b") (UNSPEC_CRC32H 
> "crc32h")
> Index: gcc/config/aarch64/aarch64-sve.md
> ===================================================================
> --- gcc/config/aarch64/aarch64-sve.md   2018-01-26 13:55:12.277219211 +0000
> +++ gcc/config/aarch64/aarch64-sve.md   2018-01-26 14:00:15.716916999 +0000
> @@ -817,7 +817,7 @@ (define_insn "*aarch64_sve_<perm_insn><p
>    "<perm_insn><perm_hilo>\t%0.<Vetype>, %1.<Vetype>, %2.<Vetype>"
>  )
>
> -(define_insn "*aarch64_sve_<perm_insn><perm_hilo><mode>"
> +(define_insn "aarch64_sve_<perm_insn><perm_hilo><mode>"
>    [(set (match_operand:SVE_ALL 0 "register_operand" "=w")
>         (unspec:SVE_ALL [(match_operand:SVE_ALL 1 "register_operand" "w")
>                          (match_operand:SVE_ALL 2 "register_operand" "w")]
> @@ -2156,7 +2156,7 @@ (define_insn "*<optab><mode>vnx4sf2"
>  )
>
>  ;; Conversion of DI or SI to DF, predicated with a PTRUE.
> -(define_insn "*<optab><mode>vnx2df2"
> +(define_insn "aarch64_sve_<optab><mode>vnx2df2"
>    [(set (match_operand:VNx2DF 0 "register_operand" "=w")
>         (unspec:VNx2DF
>           [(match_operand:VNx2BI 1 "register_operand" "Upl")
> @@ -2183,7 +2183,7 @@ (define_insn "*trunc<Vwide><mode>2"
>
>  ;; Conversion of SFs to the same number of DFs, or HFs to the same number
>  ;; of SFs.
> -(define_insn "*extend<mode><Vwide>2"
> +(define_insn "aarch64_sve_extend<mode><Vwide>2"
>    [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
>         (unspec:<VWIDE>
>           [(match_operand:<VWIDE_PRED> 1 "register_operand" "Upl")
> @@ -2195,17 +2195,50 @@ (define_insn "*extend<mode><Vwide>2"
>    "fcvt\t%0.<Vewtype>, %1/m, %2.<Vetype>"
>  )
>
> +;; Unpack the low or high half of a predicate, where "high" refers to
> +;; the low-numbered lanes for big-endian and the high-numbered lanes
> +;; for little-endian.
> +(define_expand "vec_unpack<su>_<perm_hilo>_<mode>"
> +  [(match_operand:<VWIDE> 0 "register_operand")
> +   (unspec:<VWIDE> [(match_operand:PRED_BHS 1 "register_operand")]
> +                  UNPACK)]
> +  "TARGET_SVE"
> +  {
> +    emit_insn ((<hi_lanes_optab>
> +               ? gen_aarch64_sve_punpkhi_<PRED_BHS:mode>
> +               : gen_aarch64_sve_punpklo_<PRED_BHS:mode>)
> +              (operands[0], operands[1]));
> +    DONE;
> +  }
> +)
> +
>  ;; PUNPKHI and PUNPKLO.
> -(define_insn "vec_unpack<su>_<perm_hilo>_<mode>"
> +(define_insn "aarch64_sve_punpk<perm_hilo>_<mode>"
>    [(set (match_operand:<VWIDE> 0 "register_operand" "=Upa")
>         (unspec:<VWIDE> [(match_operand:PRED_BHS 1 "register_operand" "Upa")]
> -                       UNPACK))]
> +                       UNPACK_UNSIGNED))]
>    "TARGET_SVE"
>    "punpk<perm_hilo>\t%0.h, %1.b"
>  )
>
> +;; Unpack the low or high half of a vector, where "high" refers to
> +;; the low-numbered lanes for big-endian and the high-numbered lanes
> +;; for little-endian.
> +(define_expand "vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>"
> +  [(match_operand:<VWIDE> 0 "register_operand")
> +   (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand")] UNPACK)]
> +  "TARGET_SVE"
> +  {
> +    emit_insn ((<hi_lanes_optab>
> +               ? gen_aarch64_sve_<su>unpkhi_<SVE_BHSI:mode>
> +               : gen_aarch64_sve_<su>unpklo_<SVE_BHSI:mode>)
> +              (operands[0], operands[1]));
> +    DONE;
> +  }
> +)
> +
>  ;; SUNPKHI, UUNPKHI, SUNPKLO and UUNPKLO.
> -(define_insn "vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>"
> +(define_insn "aarch64_sve_<su>unpk<perm_hilo>_<SVE_BHSI:mode>"
>    [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
>         (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand" "w")]
>                         UNPACK))]
> @@ -2213,32 +2246,28 @@ (define_insn "vec_unpack<su>_<perm_hilo>
>    "<su>unpk<perm_hilo>\t%0.<Vewtype>, %1.<Vetype>"
>  )
>
> -;; Used by the vec_unpacks_<perm_hilo>_<mode> expander to unpack the bit
> -;; representation of a VNx4SF or VNx8HF without conversion.  The choice
> -;; between signed and unsigned isn't significant.
> -(define_insn "*vec_unpacku_<perm_hilo>_<mode>_no_convert"
> -  [(set (match_operand:SVE_HSF 0 "register_operand" "=w")
> -       (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand" "w")]
> -                       UNPACK_UNSIGNED))]
> -  "TARGET_SVE"
> -  "uunpk<perm_hilo>\t%0.<Vewtype>, %1.<Vetype>"
> -)
> -
>  ;; Unpack one half of a VNx4SF to VNx2DF, or one half of a VNx8HF to VNx4SF.
>  ;; First unpack the source without conversion, then float-convert the
>  ;; unpacked source.
>  (define_expand "vec_unpacks_<perm_hilo>_<mode>"
> -  [(set (match_dup 2)
> -       (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand")]
> -                       UNPACK_UNSIGNED))
> -   (set (match_operand:<VWIDE> 0 "register_operand")
> -       (unspec:<VWIDE> [(match_dup 3)
> -                        (unspec:<VWIDE> [(match_dup 2)] 
> UNSPEC_FLOAT_CONVERT)]
> -                       UNSPEC_MERGE_PTRUE))]
> -  "TARGET_SVE"
> -  {
> -    operands[2] = gen_reg_rtx (<MODE>mode);
> -    operands[3] = force_reg (<VWIDE_PRED>mode, CONSTM1_RTX 
> (<VWIDE_PRED>mode));
> +  [(match_operand:<VWIDE> 0 "register_operand")
> +   (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand")]
> +                  UNPACK_UNSIGNED)]
> +  "TARGET_SVE"
> +  {
> +    /* Use ZIP to do the unpack, since we don't care about the upper halves
> +       and since it has the nice property of not needing any subregs.
> +       If using UUNPK* turns out to be preferable, we could model it as
> +       a ZIP whose first operand is zero.  */
> +    rtx temp = gen_reg_rtx (<MODE>mode);
> +    emit_insn ((<hi_lanes_optab>
> +               ? gen_aarch64_sve_zip2<mode>
> +               : gen_aarch64_sve_zip1<mode>)
> +               (temp, operands[1], operands[1]));
> +    rtx ptrue = force_reg (<VWIDE_PRED>mode, CONSTM1_RTX (<VWIDE_PRED>mode));
> +    emit_insn (gen_aarch64_sve_extend<mode><Vwide>2 (operands[0],
> +                                                    ptrue, temp));
> +    DONE;
>    }
>  )
>
> @@ -2246,18 +2275,25 @@ (define_expand "vec_unpacks_<perm_hilo>_
>  ;; to VNx2DI, reinterpret the VNx2DI as a VNx4SI, then convert the
>  ;; unpacked VNx4SI to VNx2DF.
>  (define_expand "vec_unpack<su_optab>_float_<perm_hilo>_vnx4si"
> -  [(set (match_dup 2)
> -       (unspec:VNx2DI [(match_operand:VNx4SI 1 "register_operand")]
> -                      UNPACK_UNSIGNED))
> -   (set (match_operand:VNx2DF 0 "register_operand")
> -       (unspec:VNx2DF [(match_dup 3)
> -                       (FLOATUORS:VNx2DF (match_dup 4))]
> -                      UNSPEC_MERGE_PTRUE))]
> -  "TARGET_SVE"
> -  {
> -    operands[2] = gen_reg_rtx (VNx2DImode);
> -    operands[3] = force_reg (VNx2BImode, CONSTM1_RTX (VNx2BImode));
> -    operands[4] = gen_rtx_SUBREG (VNx4SImode, operands[2], 0);
> +  [(match_operand:VNx2DF 0 "register_operand")
> +   (FLOATUORS:VNx2DF
> +     (unspec:VNx2DI [(match_operand:VNx4SI 1 "register_operand")]
> +                   UNPACK_UNSIGNED))]
> +  "TARGET_SVE"
> +  {
> +    /* Use ZIP to do the unpack, since we don't care about the upper halves
> +       and since it has the nice property of not needing any subregs.
> +       If using UUNPK* turns out to be preferable, we could model it as
> +       a ZIP whose first operand is zero.  */
> +    rtx temp = gen_reg_rtx (VNx4SImode);
> +    emit_insn ((<hi_lanes_optab>
> +               ? gen_aarch64_sve_zip2vnx4si
> +               : gen_aarch64_sve_zip1vnx4si)
> +              (temp, operands[1], operands[1]));
> +    rtx ptrue = force_reg (VNx2BImode, CONSTM1_RTX (VNx2BImode));
> +    emit_insn (gen_aarch64_sve_<FLOATUORS:optab>vnx4sivnx2df2 (operands[0],
> +                                                              ptrue, temp));
> +    DONE;
>    }
>  )
>
> Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c 2018-01-13 
> 17:54:52.934269783 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c 2018-01-26 
> 14:00:15.716916999 +0000
> @@ -10,6 +10,6 @@ unpack_double_int_plus8 (double *d, int3
>      d[i] = s[i] + 8;
>  }
>
> -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 
> } } */
> -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 
> } } */
> +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, 
> z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, 
> z[0-9]+\.s\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m, 
> z[0-9]+\.s\n} 2 } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c       
> 2018-01-13 17:54:52.934269783 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c       
> 2018-01-26 14:00:15.717916957 +0000
> @@ -10,6 +10,6 @@ unpack_double_int_plus9 (double *d, uint
>      d[i] = (double) (s[i] + 9);
>  }
>
> -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 
> } } */
> -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 
> } } */
> +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, 
> z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, 
> z[0-9]+\.s\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m, 
> z[0-9]+\.s\n} 2 } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c       2018-01-13 
> 17:54:52.935269745 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c       2018-01-26 
> 14:00:15.717916957 +0000
> @@ -8,6 +8,6 @@ unpack_float_plus_7point9 (double *d, fl
>      d[i] = s[i] + 7.9;
>  }
>
> -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 
> } } */
> -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 
> } } */
> +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, 
> z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, 
> z[0-9]+\.s\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tfcvt\tz[0-9]+\.d, p[0-7]/m, 
> z[0-9]+\.s\n} 2 } } */

Re: [SLP/AArch64] Fix unpack handling for big-endian SVE

Reply via email to