On Wed, Jul 15, 2020 at 5:40 PM Dmitrij Pochepko
<dmitrij.poche...@bell-sw.com> wrote:
>
> Hi,
>
> here is an enhancement to gcc, which allows load/store groups with size being 
> non-power-of-2 to be vectorized.
> Current implementation is using interleaving permutations to transform 
> load/store groups. That is where power-of-2 requirements comes from.
> For N-element vectors simplest approch would be to use N single element 
> insertions for any required vector permutation.
> And for 2-element vectors it is a reasonable amount of insertions.
> Using this approach allows vectorization for cases, which were not supported 
> before.
>
> bootstrapped and tested on x86_64-pc-linux-gnu and aarch64-linux-gnu.

I believe a more general fix revolves around making SLP discovery not
fail on the
not grouped load *k.  Quoting the testcase:

typedef struct {
    double m1, m2, m3, m4, m5;
} the_struct_t;

double bar1 (the_struct_t*);

double foo (double* k, unsigned int n, the_struct_t* the_struct)
{
    unsigned int u;
    the_struct_t result;
    for (u=0; u < n; u++, k--) {
       result.m1 += (*k)*the_struct[u].m1;
       result.m2 += (*k)*the_struct[u].m2;
       result.m3 += (*k)*the_struct[u].m3;
       result.m4 += (*k)*the_struct[u].m4;
    }
    return bar1 (&result);
}

here *k could be accepted because it is the same in every
SLP lane.  Implementation-wise I think we'd handle a DR
group with a single element just fine here, we just have to be
careful (at this point) to not overeagerly make them so.

I've played with similar changes in this area already but some
refactoring could make things nicer.  So it's still on my TODO
to make the above SLP vectorized.

Richard.

> Thanks,
> Dmitrij

Reply via email to