https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122347

            Bug ID: 122347
           Summary: Reuse memory access via loop tiling when vectorizing
                    inner loop
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

For a loop-nest, if memory accesses in inner loop which remains unchanged among
the outer loop, reusing the accesses via loop-tiling or other similar transform
when vectorizing the inner loop is not supported in some situation. Especially,
when some backend(such as -mcpu=neoverse-n2) might provide a suggested unroll
factor more than 1, it is better to consider tiling rather than inner-loop
unrolling beyond vectorzation. For example:

void foo(int *__restrict__ output, char *a, char *b, int n)
{
  for (int i = 0; i < 2048; ++i)
    {
      int offset = i * n;
      int sum = 0;

      for (int j = 0; j < 2048; ++j)
        sum += a[offset + j] * b[j];

      output[i] = sum;
    }
}

On neoverse-n2, we get:

void foo(int *__restrict__ output, char *a, char *b, int n)
{
  // outer loop: not unrolled
  for (int i = 0; i < 2048; ++i)
    {
      int offset = i * n;
      vector(4) int v_sum0 = { 0 };
      vector(4) int v_sum1 = { 0 };

      // inner loop: suggested unroll factor = 2
      for (int j = 0; j < 2048; j += 16 * 2)
        {
          vector(16) char v_a0 = VECTOR_LOAD (&a[offset + j + 16 * 0]);
          vector(16) char v_a1 = VECTOR_LOAD (&a[offset + j + 16 * 1]);
          vector(16) char v_b0 = VECTOR_LOAD (&b[j + 16 * 0]);
          vector(16) char v_b1 = VECTOR_LOAD (&b[j + 16 * 1]);

          v_sum0 += DOT_PROD (v_a0, v_b0);  // v_b0 and v_b1 are different
          v_sum1 += DOT_PROD (v_a1, v_b1);
        }
      output[i] = REDUC_PLUS (v_sum0 + v_sum1);
    }
}

While we found that "b[j]" would not be changed, we could enable reusing of
"b[j]" via loop tiling, which does some transform similar to unrolling of the
outer loop, the codegen is as:

void foo(int *__restrict__ output, char *a, char *b, int n)
{
  // outer loop: unrolled with 2
  for (int i = 0; i < 2048; i += 2)
    {
      int offset0 = (i + 0) * n;
      int offset1 = (i + 1) * n;
      vector(4) int v_sum0 = { 0 };
      vector(4) int v_sum1 = { 0 };

      // inner loop: suggested unroll factor = 1
      for (int j = 0; j < 2048; j += 16)
        {
          vector(16) char v_a0 = VECTOR_LOAD (&a[offset0 + j]);
          vector(16) char v_a1 = VECTOR_LOAD (&a[offset1 + j]);
          vector(16) char v_b = VECTOR_LOAD (&b[j]);

          v_sum0 += DOT_PROD (v_a0, v_b);  // resue b[j]
          v_sum1 += DOT_PROD (v_a1, v_b);  // reuse b[j]
        }
      output[i + 0] = REDUC_PLUS (v_sum0);
      output[i + 1] = REDUC_PLUS (v_sum1);
    }
}
  • [Bug tree-optimization/122... fxue at os dot amperecomputing.com via Gcc-bugs

Reply via email to