https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122347
Bug ID: 122347
Summary: Reuse memory access via loop tiling when vectorizing
inner loop
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: fxue at os dot amperecomputing.com
Target Milestone: ---
For a loop-nest, if memory accesses in inner loop which remains unchanged among
the outer loop, reusing the accesses via loop-tiling or other similar transform
when vectorizing the inner loop is not supported in some situation. Especially,
when some backend(such as -mcpu=neoverse-n2) might provide a suggested unroll
factor more than 1, it is better to consider tiling rather than inner-loop
unrolling beyond vectorzation. For example:
void foo(int *__restrict__ output, char *a, char *b, int n)
{
for (int i = 0; i < 2048; ++i)
{
int offset = i * n;
int sum = 0;
for (int j = 0; j < 2048; ++j)
sum += a[offset + j] * b[j];
output[i] = sum;
}
}
On neoverse-n2, we get:
void foo(int *__restrict__ output, char *a, char *b, int n)
{
// outer loop: not unrolled
for (int i = 0; i < 2048; ++i)
{
int offset = i * n;
vector(4) int v_sum0 = { 0 };
vector(4) int v_sum1 = { 0 };
// inner loop: suggested unroll factor = 2
for (int j = 0; j < 2048; j += 16 * 2)
{
vector(16) char v_a0 = VECTOR_LOAD (&a[offset + j + 16 * 0]);
vector(16) char v_a1 = VECTOR_LOAD (&a[offset + j + 16 * 1]);
vector(16) char v_b0 = VECTOR_LOAD (&b[j + 16 * 0]);
vector(16) char v_b1 = VECTOR_LOAD (&b[j + 16 * 1]);
v_sum0 += DOT_PROD (v_a0, v_b0); // v_b0 and v_b1 are different
v_sum1 += DOT_PROD (v_a1, v_b1);
}
output[i] = REDUC_PLUS (v_sum0 + v_sum1);
}
}
While we found that "b[j]" would not be changed, we could enable reusing of
"b[j]" via loop tiling, which does some transform similar to unrolling of the
outer loop, the codegen is as:
void foo(int *__restrict__ output, char *a, char *b, int n)
{
// outer loop: unrolled with 2
for (int i = 0; i < 2048; i += 2)
{
int offset0 = (i + 0) * n;
int offset1 = (i + 1) * n;
vector(4) int v_sum0 = { 0 };
vector(4) int v_sum1 = { 0 };
// inner loop: suggested unroll factor = 1
for (int j = 0; j < 2048; j += 16)
{
vector(16) char v_a0 = VECTOR_LOAD (&a[offset0 + j]);
vector(16) char v_a1 = VECTOR_LOAD (&a[offset1 + j]);
vector(16) char v_b = VECTOR_LOAD (&b[j]);
v_sum0 += DOT_PROD (v_a0, v_b); // resue b[j]
v_sum1 += DOT_PROD (v_a1, v_b); // reuse b[j]
}
output[i + 0] = REDUC_PLUS (v_sum0);
output[i + 1] = REDUC_PLUS (v_sum1);
}
}