https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123414
--- Comment #5 from Robin Dapp <rdapp at gcc dot gnu.org> ---
Heh, it's more insidious than that and it's indeed a middle-end issue and not a
target one.
For emulated reduction we use a scheme like
/* Case 2: Create:
for (offset = nelements/2; offset >= 1; offset/=2)
{
Create: va' = vec_shift <va, offset>
Create: va = vop <va, va'>
} */
where the shifts are done through permutes.
In order to keep the neutral value 1, the result of this reduction is inserted
in a vector like
{res, 1, 1, 1, ...}.
We optimize this vector constructor in forwprop and the permute looks like
VEC_PERM_EXPR <res_vec, {1, 1, 1, ...}, <0, 256, 257, ..., 511>}.
Now with zvl256b and LMUL8 a char vector has 256 elements but we use unsigned
char as permute mask type.
Thus, we build a tree for the mask op {0, 1, 2, 3, .., 255} (because 256
overflows the mask type).
LMUL once again testing the limits of vectors :)
I'm testing a patch. I think the issue is here:
mask_type
= build_vector_type (build_nonstandard_integer_type (elem_size, 1),
refnelts);
where refnelts = 256 instead of 512. For a permute we need to be able to use
indices up to refnelts * 2.