[Bug tree-optimization/80570] auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

2021-09-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570

--- Comment #3 from Peter Cordes  ---
(In reply to Andrew Pinski from comment #2)
> Even on aarch64:
> 
> .L2:
> ldr q0, [x1], 16
> sxtlv1.2d, v0.2s
> sxtl2   v0.2d, v0.4s
> scvtf   v1.2d, v1.2d
> scvtf   v0.2d, v0.2d
> stp q1, q0, [x0]
>
> But the above is decent really.

More that decent, that's what we *should* be doing, I think.

AArch64 has versions of most instructions that read the top of a vector, unlike
x86-64 where VPMOVZX / SX can only read from the bottom half.  That's the key
difference, and what makes this strategy good on ARM, bad on x86-64.

(On 32-bit ARM, you load a q register, then read the two halves separately as
64-bit d<0..31> registers.  AArch64 changed that so there are 32x 128-bit
vector regs, and no partial regs aliasing the high half.  But they provide OP,
OP2 versions of some instructions that widen or things like that, with the "2"
version accessing a high half.  Presumably part of the motivation is to make it
easier to port ARM NEON code that depended on accessing halves of a 128-bit q
vector using its d regs.  But it's a generally reasonable design and could also
be motivated by seeing how inconvenient things get in SSE and AVX for
pmovsx/zx.) 

 Anyway, AArch64 SIMD is specifically designed to make it fully efficient to do
wide loads and then unpack both halves, like is possible in ARM, but not
x86-64.  

It's also using a store (of a pair of regs) that's twice the width of the load.
 But even if it was using a max-width load of a pair of 128-bit vectors (and
having to store two pairs) that would be good, just effectively unrolling.  But
GCC sees it as one load and two separate stores, that it just happens to be
able to combine as a pair.

[Bug tree-optimization/80570] auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

2021-09-26 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570

Andrew Pinski  changed:

   What|Removed |Added

  Component|target  |tree-optimization

--- Comment #2 from Andrew Pinski  ---
  vect__4.5_24 = MEM  [(int *)ip_12 + ivtmp.15_28 * 1];
  vect_tmp_14.6_23 = [vec_unpack_float_lo_expr] vect__4.5_24;
  vect_tmp_14.6_22 = [vec_unpack_float_hi_expr] vect__4.5_24;
  MEM  [(double *)dp_10 + ivtmp.15_28 * 2] =
vect_tmp_14.6_23;
  MEM  [(double *)dp_10 + 32B + ivtmp.15_28 * 2] =
vect_tmp_14.6_22;

Even on aarch64:

.L2:
ldr q0, [x1], 16
sxtlv1.2d, v0.2s
sxtl2   v0.2d, v0.4s
scvtf   v1.2d, v1.2d
scvtf   v0.2d, v0.2d
stp q1, q0, [x0]
add x0, x0, 32
cmp x2, x1
bne .L2

But the above is decent really.