https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122219

            Bug ID: 122219
           Summary: Missed store sinking when using memcpy with
                    vector_size type in inlined functions
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pfustc at gcc dot gnu.org
  Target Milestone: ---

When using an inlined function to cast data from float32x4_t (16 bytes) to a
32-byte vector_size type, GCC fails to sink the stores out of the loop. The
same logic expressed directly with memcpy is optimized correctly.

Reproducer:

#include <arm_neon.h>

typedef float v256_t __attribute__((vector_size(32)));

inline v256_t cast(float32x4_t fv) {
  v256_t r;
  __builtin_memcpy(&r, &fv, 16);
  return r;
}

float foo(unsigned n) {
  v256_t vec;

  float32x4_t fv = {1.0f, 2.0f, 3.0f, 4.0f};
  for (int i = 0; i < n; i++) {
    fv = vmulq_f32(fv, fv);
    vec = cast(fv);
  }
  return vec[n % 8];
}

float bar(unsigned n) {
  v256_t vec;

  float32x4_t fv = {1.0f, 2.0f, 3.0f, 4.0f};
  for (int i = 0; i < n; i++) {
    fv = vmulq_f32(fv, fv);
    __builtin_memcpy(&vec, &fv, 16);
  }
  return vec[n % 8];
}

We observed below assembly difference when compiling with "-O3" on AArch64.

foo() - BAD (store sinking failed)

.L3:
        add     w1, w1, 1
        fmul    v31.4s, v31.4s, v31.4s
        str     q31, [sp]
        ldp     q29, q30, [x2]
        stp     q29, q30, [sp, 32]
        cmp     w1, w0
        bne     .L3

bar() - GOOD (store sinking succeeded)

.L12:
        add     w1, w1, 1
        fmul    v31.4s, v31.4s, v31.4s
        cmp     w1, w0
        bne     .L12
        str     q31, [sp]

Initial analysis:

Below are some GIMPLE statements generated in "foo()" before the LIM pass:

  MEM <uint128_t> [(char * {ref-all})&ret] = _10;
  ret.1_14 = ret;
  ret ={v} {CLOBBER(eos)};
  vec = ret.1_14;

Store sinking in the LIM pass found a dependence between the first MEM [&ret] =
... and its following "ret.1_14 = ret". So it fails to sink the MEM out.

How should we fix this?

Reply via email to