On Fri, Jan 26, 2018 at 3:21 PM, Richard Sandiford <richard.sandif...@linaro.org> wrote: > I hadn't realised that on big-endian targets, VEC_UNPACK*HI_EXPR unpacks > the low-numbered lanes and VEC_UNPACK*LO_EXPR unpacks the high-numbered > lanes. This meant that both the SVE patterns and the handling of > fully-masked loops were wrong. > > The patch deals with that by making sure that all vec_unpack* optabs > are define_expands, using BYTES_BIG_ENDIAN to choose the appropriate > define_insn. This in turn meant that we can get rid of the duplication > between the signed and unsigned patterns for predicates. (We provide > implementations of both the signed and unsigned optabs because the sign > doesn't matter for predicates: every element contains only one > significant bit.) > > Also, the float unpacks need to unpack one half of the input vector, > but the unpacked upper bits are "don't care". There are two obvious > ways of handling that: use an unpack (filling with zeros) or use a ZIP > (filling with a duplicate of the low bits). The code previously used > unpacks, but the sequence involved a subreg that is semantically an > element reverse on big-endian targets. Using the ZIP patterns avoids > that, and at the moment there's no reason to prefer one over the other > for performance reasons, so the patch switches to ZIP unconditionally. > As the comment says, it would be easy to optimise this later if UUNPK > turns out to be better for some implementations. > > Tested on aarch64_be-elf, aarch64-linux-gnu and x86_64-linux-gnu. > I think the tree-vect-loop-manip.c part is obvious, but OK for the > AArch64 bits? > > Richard > > > 2018-01-26 Richard Sandiford <richard.sandif...@linaro.org> > > gcc/ > * tree-vect-loop-manip.c (vect_maybe_permute_loop_masks): > Reverse the choice between VEC_UNPACK_LO_EXPR and VEC_UNPACK_HI_EXPR > for big-endian. > * config/aarch64/iterators.md (hi_lanes_optab): New int attribute. > * config/aarch64/aarch64-sve.md > (*aarch64_sve_<perm_insn><perm_hilo><mode>): Rename to... > (aarch64_sve_<perm_insn><perm_hilo><mode>): ...this. > (*extend<mode><Vwide>2): Rename to... > (aarch64_sve_extend<mode><Vwide>2): ...this. > (vec_unpack<su>_<perm_hilo>_<mode>): Turn into a define_expand, > renaming the old pattern to... > (aarch64_sve_punpk<perm_hilo>_<mode>): ...this. Only define > unsigned packs. > (vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>): Turn into a > define_expand, renaming the old pattern to... > (aarch64_sve_<su>unpk<perm_hilo>_<SVE_BHSI:mode>): ...this. > (*vec_unpacku_<perm_hilo>_<mode>_no_convert): Delete. > (vec_unpacks_<perm_hilo>_<mode>): Take BYTES_BIG_ENDIAN into > account when deciding which SVE instruction the optab should use. > (vec_unpack<su_optab>_float_<perm_hilo>_vnx4si): Likewise. > > gcc/testsuite/ > * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Expect zips rather > than unpacks. > * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise. > * gcc.target/aarch64/sve/unpack_float_1.c: Likewise. > > Index: gcc/tree-vect-loop-manip.c > =================================================================== > --- gcc/tree-vect-loop-manip.c 2018-01-13 18:02:00.948360274 +0000 > +++ gcc/tree-vect-loop-manip.c 2018-01-26 14:00:15.717916957 +0000 > @@ -334,7 +334,8 @@ vect_maybe_permute_loop_masks (gimple_se > { > tree src = src_rgm->masks[i / 2]; > tree dest = dest_rgm->masks[i]; > - tree_code code = (i & 1 ? VEC_UNPACK_HI_EXPR > + tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1) > + ? VEC_UNPACK_HI_EXPR > : VEC_UNPACK_LO_EXPR);
This looks broken anyways -- the element oder on GIMPLE is the same for little and big-endian. I see two other BYTES_BIG_ENDIAN uses in tree-vect-*, not sure when they crept in but they are all broken. Richard. > gassign *stmt; > if (dest_masktype == unpack_masktype) > Index: gcc/config/aarch64/iterators.md > =================================================================== > --- gcc/config/aarch64/iterators.md 2018-01-13 18:01:26.106685901 +0000 > +++ gcc/config/aarch64/iterators.md 2018-01-26 14:00:15.716916999 +0000 > @@ -1674,6 +1674,15 @@ (define_int_attr perm_hilo [(UNSPEC_ZIP1 > (UNSPEC_UNPACKSHI "hi") (UNSPEC_UNPACKUHI "hi") > (UNSPEC_UNPACKSLO "lo") (UNSPEC_UNPACKULO "lo")]) > > +;; Return true if the associated optab refers to the high-numbered lanes, > +;; false if it refers to the low-numbered lanes. The convention is for > +;; "hi" to refer to the low-numbered lanes (the first ones in memory) > +;; for big-endian. > +(define_int_attr hi_lanes_optab [(UNSPEC_UNPACKSHI "!BYTES_BIG_ENDIAN") > + (UNSPEC_UNPACKUHI "!BYTES_BIG_ENDIAN") > + (UNSPEC_UNPACKSLO "BYTES_BIG_ENDIAN") > + (UNSPEC_UNPACKULO "BYTES_BIG_ENDIAN")]) > + > (define_int_attr frecp_suffix [(UNSPEC_FRECPE "e") (UNSPEC_FRECPX "x")]) > > (define_int_attr crc_variant [(UNSPEC_CRC32B "crc32b") (UNSPEC_CRC32H > "crc32h") > Index: gcc/config/aarch64/aarch64-sve.md > =================================================================== > --- gcc/config/aarch64/aarch64-sve.md 2018-01-26 13:55:12.277219211 +0000 > +++ gcc/config/aarch64/aarch64-sve.md 2018-01-26 14:00:15.716916999 +0000 > @@ -817,7 +817,7 @@ (define_insn "*aarch64_sve_<perm_insn><p > "<perm_insn><perm_hilo>\t%0.<Vetype>, %1.<Vetype>, %2.<Vetype>" > ) > > -(define_insn "*aarch64_sve_<perm_insn><perm_hilo><mode>" > +(define_insn "aarch64_sve_<perm_insn><perm_hilo><mode>" > [(set (match_operand:SVE_ALL 0 "register_operand" "=w") > (unspec:SVE_ALL [(match_operand:SVE_ALL 1 "register_operand" "w") > (match_operand:SVE_ALL 2 "register_operand" "w")] > @@ -2156,7 +2156,7 @@ (define_insn "*<optab><mode>vnx4sf2" > ) > > ;; Conversion of DI or SI to DF, predicated with a PTRUE. > -(define_insn "*<optab><mode>vnx2df2" > +(define_insn "aarch64_sve_<optab><mode>vnx2df2" > [(set (match_operand:VNx2DF 0 "register_operand" "=w") > (unspec:VNx2DF > [(match_operand:VNx2BI 1 "register_operand" "Upl") > @@ -2183,7 +2183,7 @@ (define_insn "*trunc<Vwide><mode>2" > > ;; Conversion of SFs to the same number of DFs, or HFs to the same number > ;; of SFs. > -(define_insn "*extend<mode><Vwide>2" > +(define_insn "aarch64_sve_extend<mode><Vwide>2" > [(set (match_operand:<VWIDE> 0 "register_operand" "=w") > (unspec:<VWIDE> > [(match_operand:<VWIDE_PRED> 1 "register_operand" "Upl") > @@ -2195,17 +2195,50 @@ (define_insn "*extend<mode><Vwide>2" > "fcvt\t%0.<Vewtype>, %1/m, %2.<Vetype>" > ) > > +;; Unpack the low or high half of a predicate, where "high" refers to > +;; the low-numbered lanes for big-endian and the high-numbered lanes > +;; for little-endian. > +(define_expand "vec_unpack<su>_<perm_hilo>_<mode>" > + [(match_operand:<VWIDE> 0 "register_operand") > + (unspec:<VWIDE> [(match_operand:PRED_BHS 1 "register_operand")] > + UNPACK)] > + "TARGET_SVE" > + { > + emit_insn ((<hi_lanes_optab> > + ? gen_aarch64_sve_punpkhi_<PRED_BHS:mode> > + : gen_aarch64_sve_punpklo_<PRED_BHS:mode>) > + (operands[0], operands[1])); > + DONE; > + } > +) > + > ;; PUNPKHI and PUNPKLO. > -(define_insn "vec_unpack<su>_<perm_hilo>_<mode>" > +(define_insn "aarch64_sve_punpk<perm_hilo>_<mode>" > [(set (match_operand:<VWIDE> 0 "register_operand" "=Upa") > (unspec:<VWIDE> [(match_operand:PRED_BHS 1 "register_operand" "Upa")] > - UNPACK))] > + UNPACK_UNSIGNED))] > "TARGET_SVE" > "punpk<perm_hilo>\t%0.h, %1.b" > ) > > +;; Unpack the low or high half of a vector, where "high" refers to > +;; the low-numbered lanes for big-endian and the high-numbered lanes > +;; for little-endian. > +(define_expand "vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>" > + [(match_operand:<VWIDE> 0 "register_operand") > + (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand")] UNPACK)] > + "TARGET_SVE" > + { > + emit_insn ((<hi_lanes_optab> > + ? gen_aarch64_sve_<su>unpkhi_<SVE_BHSI:mode> > + : gen_aarch64_sve_<su>unpklo_<SVE_BHSI:mode>) > + (operands[0], operands[1])); > + DONE; > + } > +) > + > ;; SUNPKHI, UUNPKHI, SUNPKLO and UUNPKLO. > -(define_insn "vec_unpack<su>_<perm_hilo>_<SVE_BHSI:mode>" > +(define_insn "aarch64_sve_<su>unpk<perm_hilo>_<SVE_BHSI:mode>" > [(set (match_operand:<VWIDE> 0 "register_operand" "=w") > (unspec:<VWIDE> [(match_operand:SVE_BHSI 1 "register_operand" "w")] > UNPACK))] > @@ -2213,32 +2246,28 @@ (define_insn "vec_unpack<su>_<perm_hilo> > "<su>unpk<perm_hilo>\t%0.<Vewtype>, %1.<Vetype>" > ) > > -;; Used by the vec_unpacks_<perm_hilo>_<mode> expander to unpack the bit > -;; representation of a VNx4SF or VNx8HF without conversion. The choice > -;; between signed and unsigned isn't significant. > -(define_insn "*vec_unpacku_<perm_hilo>_<mode>_no_convert" > - [(set (match_operand:SVE_HSF 0 "register_operand" "=w") > - (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand" "w")] > - UNPACK_UNSIGNED))] > - "TARGET_SVE" > - "uunpk<perm_hilo>\t%0.<Vewtype>, %1.<Vetype>" > -) > - > ;; Unpack one half of a VNx4SF to VNx2DF, or one half of a VNx8HF to VNx4SF. > ;; First unpack the source without conversion, then float-convert the > ;; unpacked source. > (define_expand "vec_unpacks_<perm_hilo>_<mode>" > - [(set (match_dup 2) > - (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand")] > - UNPACK_UNSIGNED)) > - (set (match_operand:<VWIDE> 0 "register_operand") > - (unspec:<VWIDE> [(match_dup 3) > - (unspec:<VWIDE> [(match_dup 2)] > UNSPEC_FLOAT_CONVERT)] > - UNSPEC_MERGE_PTRUE))] > - "TARGET_SVE" > - { > - operands[2] = gen_reg_rtx (<MODE>mode); > - operands[3] = force_reg (<VWIDE_PRED>mode, CONSTM1_RTX > (<VWIDE_PRED>mode)); > + [(match_operand:<VWIDE> 0 "register_operand") > + (unspec:SVE_HSF [(match_operand:SVE_HSF 1 "register_operand")] > + UNPACK_UNSIGNED)] > + "TARGET_SVE" > + { > + /* Use ZIP to do the unpack, since we don't care about the upper halves > + and since it has the nice property of not needing any subregs. > + If using UUNPK* turns out to be preferable, we could model it as > + a ZIP whose first operand is zero. */ > + rtx temp = gen_reg_rtx (<MODE>mode); > + emit_insn ((<hi_lanes_optab> > + ? gen_aarch64_sve_zip2<mode> > + : gen_aarch64_sve_zip1<mode>) > + (temp, operands[1], operands[1])); > + rtx ptrue = force_reg (<VWIDE_PRED>mode, CONSTM1_RTX (<VWIDE_PRED>mode)); > + emit_insn (gen_aarch64_sve_extend<mode><Vwide>2 (operands[0], > + ptrue, temp)); > + DONE; > } > ) > > @@ -2246,18 +2275,25 @@ (define_expand "vec_unpacks_<perm_hilo>_ > ;; to VNx2DI, reinterpret the VNx2DI as a VNx4SI, then convert the > ;; unpacked VNx4SI to VNx2DF. > (define_expand "vec_unpack<su_optab>_float_<perm_hilo>_vnx4si" > - [(set (match_dup 2) > - (unspec:VNx2DI [(match_operand:VNx4SI 1 "register_operand")] > - UNPACK_UNSIGNED)) > - (set (match_operand:VNx2DF 0 "register_operand") > - (unspec:VNx2DF [(match_dup 3) > - (FLOATUORS:VNx2DF (match_dup 4))] > - UNSPEC_MERGE_PTRUE))] > - "TARGET_SVE" > - { > - operands[2] = gen_reg_rtx (VNx2DImode); > - operands[3] = force_reg (VNx2BImode, CONSTM1_RTX (VNx2BImode)); > - operands[4] = gen_rtx_SUBREG (VNx4SImode, operands[2], 0); > + [(match_operand:VNx2DF 0 "register_operand") > + (FLOATUORS:VNx2DF > + (unspec:VNx2DI [(match_operand:VNx4SI 1 "register_operand")] > + UNPACK_UNSIGNED))] > + "TARGET_SVE" > + { > + /* Use ZIP to do the unpack, since we don't care about the upper halves > + and since it has the nice property of not needing any subregs. > + If using UUNPK* turns out to be preferable, we could model it as > + a ZIP whose first operand is zero. */ > + rtx temp = gen_reg_rtx (VNx4SImode); > + emit_insn ((<hi_lanes_optab> > + ? gen_aarch64_sve_zip2vnx4si > + : gen_aarch64_sve_zip1vnx4si) > + (temp, operands[1], operands[1])); > + rtx ptrue = force_reg (VNx2BImode, CONSTM1_RTX (VNx2BImode)); > + emit_insn (gen_aarch64_sve_<FLOATUORS:optab>vnx4sivnx2df2 (operands[0], > + ptrue, temp)); > + DONE; > } > ) > > Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c > =================================================================== > --- gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c 2018-01-13 > 17:54:52.934269783 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c 2018-01-26 > 14:00:15.716916999 +0000 > @@ -10,6 +10,6 @@ unpack_double_int_plus8 (double *d, int3 > d[i] = s[i] + 8; > } > > -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 > } } */ > -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 > } } */ > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, > z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, > z[0-9]+\.s\n} 1 } } */ > /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m, > z[0-9]+\.s\n} 2 } } */ > Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c > =================================================================== > --- gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c > 2018-01-13 17:54:52.934269783 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c > 2018-01-26 14:00:15.717916957 +0000 > @@ -10,6 +10,6 @@ unpack_double_int_plus9 (double *d, uint > d[i] = (double) (s[i] + 9); > } > > -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 > } } */ > -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 > } } */ > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, > z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, > z[0-9]+\.s\n} 1 } } */ > /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m, > z[0-9]+\.s\n} 2 } } */ > Index: gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c > =================================================================== > --- gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c 2018-01-13 > 17:54:52.935269745 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve/unpack_float_1.c 2018-01-26 > 14:00:15.717916957 +0000 > @@ -8,6 +8,6 @@ unpack_float_plus_7point9 (double *d, fl > d[i] = s[i] + 7.9; > } > > -/* { dg-final { scan-assembler-times {\tuunpklo\tz[0-9]+\.d, z[0-9]+\.s\n} 1 > } } */ > -/* { dg-final { scan-assembler-times {\tuunpkhi\tz[0-9]+\.d, z[0-9]+\.s\n} 1 > } } */ > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, > z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, > z[0-9]+\.s\n} 1 } } */ > /* { dg-final { scan-assembler-times {\tfcvt\tz[0-9]+\.d, p[0-7]/m, > z[0-9]+\.s\n} 2 } } */