https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92665

            Bug ID: 92665
           Summary: [AArch64] low lanes select not optimized out for vmlal
                    intrinsics
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: spop at gcc dot gnu.org
  Target Milestone: ---

With gcc as of today I see dup instructions that could be optimized out:

$ cat red.c
#include "arm_neon.h"

int32x4_t fun(int32x4_t a, int16x8_t b, int16x8_t c) {
  a = vmlal_s16(a, vget_low_s16(b), vget_low_s16(c));
  a = vmlal_high_s16(a, b, c);
  return a;
}

$ gcc -O3 -S -o- red.c
fun:
        dup     d3, v1.d[0]
        dup     d4, v2.d[0]
        smlal v0.4s,v3.4h,v4.4h
        smlal2 v0.4s,v1.8h,v2.8h
        ret

$ clang -O3 -S -o- red.c
fun:
        smlal   v0.4s, v1.4h, v2.4h
        smlal2  v0.4s, v1.8h, v2.8h
        ret

Reply via email to