https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82533

            Bug ID: 82533
           Summary: inefficient code generation for copy loop on falkor
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wilson at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Created attachment 42348
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42348&action=edit
testcase, to reproduce compile with -mcpu=falkor -O2 -ftree-vectorize

When lmbench stream copy is compiled with -O2 -ftree-vectorize -mcpu=falkor,
the inner loop gets compiled to
.L4:
        ldr     q0, [x2, x3]
        str     q0, [x1, x3]
        add     x3, x3, 16
        cmp     x3, x4
        bne     .L4

The str qX [reg+reg] instruction is very inefficient on Falkor.  We get a 16%
performance increase if we disable use of str qX [r+r], to get instead
.L4:
        ldr     q0, [x2, x3]
        add     x5, x1, x3
        add     x3, x3, 16
        cmp     x3, x4
        str     q0, [x5]
        bne     .L4

A proposed patch was posted to gcc-patches here
    https://gcc.gnu.org/ml/gcc-patches/2017-09/msg01547.html

Reply via email to