For the case below, the code generated by “gcc -O3” is very ugly.
char g_d[1024], g_s1[1024], g_s2[1024];
void test_loop(void)
{
char *d = g_d, *s1 = g_s1, *s2 = g_s2;
for( int y = 0; y < 128; y++ )
{
for( int x = 0; x < 16; x++ )
d[x] = s1[x] + s2[x];
d += 16;
}
}
If we change “for( int x = 0; x < 16; x++ )” to be like “for( int x = 0; x
< 32; x++ )”, very beautiful vectorization code would be generated,
test_loop:
.LFB0:
.cfi_startproc
adrp x2, g_s1
adrp x3, g_s2
add x2, x2, :lo12:g_s1
add x3, x3, :lo12:g_s2
adrp x0, g_d
adrp x1, g_d+2048
add x0, x0, :lo12:g_d
add x1, x1, :lo12:g_d+2048
ldp q1, q2, [x2]
ldp q3, q0, [x3]
add v1.16b, v1.16b, v3.16b
add v0.16b, v0.16b, v2.16b
.p2align 3,,7
.L2:
str q1, [x0]
str q0, [x0, 16]!
cmp x0, x1
bne .L2
ret
The code generated for " for( int x = 0; x < 8; x++ )" is also very ugly.
It looks gcc has potential bugs on loop vectorization. Any idea?
Thanks,
-Jiangning