https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125303
Bug ID: 125303
Summary: vector operation before shuffle produces unvectorized
shuffle
Product: gcc
Version: 16.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: lists at coryfields dot com
Target Milestone: ---
I'm not sure exactly what the precondition is for breaking the vectorized
shuffle, but the following illustrates the issue by doing an xor first:
typedef unsigned vec256 __attribute__((__vector_size__(32)));
void vec_xor(vec256& x)
{
x ^= 1;
}
void vec_shuf(vec256& x)
{
x = (vec256){x[4], x[0], x[5], x[1], x[6], x[2], x[7], x[3]};
}
void vec_xor_shuf(vec256& x)
{
x ^= 1;
x = (vec256){x[4], x[0], x[5], x[1], x[6], x[2], x[7], x[3]};
}
Godbolt link: https://godbolt.org/z/TEKWjx8jf
vec_xor and vec_shuf look as expected.
But vec_xor_shuf breaks down into a mess of loads and non-vectorized
operations.
aarch64 is perhaps the worst offender.
On aarch64, clang produces:
vec_xor_shuf(unsigned int vector[8]&):
movi v0.4s, #1
ldp q1, q2, [x0]
eor v4.16b, v1.16b, v0.16b
eor v3.16b, v2.16b, v0.16b
st2 { v3.4s, v4.4s }, [x0]
ret
While gcc16 produces:
vec_xor_shuf(unsigned int __vector(8)&):
ldp q30, q31, [x0]
mov x3, 0
movi v29.4s, 0x1
mov x1, 0
mov x2, 0
eor v30.16b, v30.16b, v29.16b
eor v29.16b, v31.16b, v29.16b
movi v31.2d, #0
dup s28, v29.s[1]
ins v31.s[0], v29.s[0]
fmov x4, d28
dup s28, v30.s[1]
ins v31.s[1], v30.s[0]
bfi x3, x4, 0, 32
fmov x4, d28
dup s28, v29.s[2]
dup s29, v29.s[3]
bfi x3, x4, 32, 32
fmov x4, d28
dup s28, v30.s[2]
dup s30, v30.s[3]
bfi x1, x4, 0, 32
fmov x4, d28
bfi x1, x4, 32, 32
fmov x4, d29
bfi x2, x4, 0, 32
fmov x4, d30
bfi x2, x4, 32, 32
fmov x4, d31
stp x1, x2, [x0, 16]
stp x4, x3, [x0]
ret
x86_64 with -mavx fares poorly as well.
This causes a vectorized impl of chacha20 to be _MUCH_ slower with gcc than
clang.