https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492
--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to ptomsich from comment #6)
> With the current master, the test case generates (with -mcpu=neoverse-n1):
> which contrasts with LLVM13 (with -mcpu=neoverse-n1):
>
> test_slp: // @test_slp
> .cfi_startproc
> // %bb.0: // %entry
> ldr q0, [x0]
> movi v1.16b, #1
> movi v2.2d, #0000000000000000
> udot v2.4s, v0.16b, v1.16b
> addv s0, v2.4s
> fmov w0, s0
> ret
> .Lfunc_end0:
> .size test_slp, .Lfunc_end0-test_slp
>
> or (LLVM13 w/o the mcpu-option):
>
> .type test_slp,@function
> test_slp: // @test_slp
> .cfi_startproc
> // %bb.0: // %entry
> ldr q0, [x0]
> ushll2 v1.8h, v0.16b, #0
> ushll v0.8h, v0.8b, #0
> uaddl2 v2.4s, v0.8h, v1.8h
> uaddl v0.4s, v0.4h, v1.4h
> add v0.4s, v0.4s, v2.4s
> addv s0, v0.4s
> fmov w0, s0
> ret
> .Lfunc_end0:
> .size test_slp, .Lfunc_end0-test_slp
It's definitely a neat trick, but correct me if I'm wrong: it's only possible
because addition is commutative.
Clang has just simply reordered the loads because the loop is very simple to
just
for( int i = 0; i < 4; i++, b += 4 )
{
tmp[i][0] = b[0];
tmp[i][1] = b[1];
tmp[i][2] = b[2];
tmp[i][3] = b[3];
}
Which GCC also handles fine.
As Richi mentioned before
>I know the "real" code this testcase is from has actual operations
> in place of the b[N] reads, for the above vectorization looks somewhat
> pointless given we end up decomposing the result again.
It seems a bit of a too narrow focus to optimize for this particular example as
the real code does "other" things.
i.e.
Both GCC and Clang fall apart with
int test_slp( unsigned char *b )
{
unsigned int tmp[4][4];
int sum = 0;
for( int i = 0; i < 4; i++, b += 4 )
{
tmp[i][0] = b[0] - b[4];
tmp[i][2] = b[1] + b[5];
tmp[i][1] = b[2] - b[6];
tmp[i][3] = b[3] + b[7];
}
for( int i = 0; i < 4; i++ )
{
sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
}
return sum;
}
which has about the same access pattern as the real code.
If you change the operations you'll notice that for others examples like
int test_slp( unsigned char *b )
{
unsigned int tmp[4][4];
int sum = 0;
for( int i = 0; i < 4; i++, b += 4 )
{
tmp[i][0] = b[0] - b[4];
tmp[i][2] = b[1] - b[5];
tmp[i][1] = b[2] - b[6];
tmp[i][3] = b[3] - b[7];
}
for( int i = 0; i < 4; i++ )
{
sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
}
return sum;
}
GCC handles this better (but we are let down by register allocation).
To me it seems quite unlikely that actual code would be written like that, but
I guess there could be a case to be made to try to reassoc loads as well.