https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82533
Bug ID: 82533 Summary: inefficient code generation for copy loop on falkor Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wilson at gcc dot gnu.org Target Milestone: --- Target: aarch64 Created attachment 42348 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42348&action=edit testcase, to reproduce compile with -mcpu=falkor -O2 -ftree-vectorize When lmbench stream copy is compiled with -O2 -ftree-vectorize -mcpu=falkor, the inner loop gets compiled to .L4: ldr q0, [x2, x3] str q0, [x1, x3] add x3, x3, 16 cmp x3, x4 bne .L4 The str qX [reg+reg] instruction is very inefficient on Falkor. We get a 16% performance increase if we disable use of str qX [r+r], to get instead .L4: ldr q0, [x2, x3] add x5, x1, x3 add x3, x3, 16 cmp x3, x4 str q0, [x5] bne .L4 A proposed patch was posted to gcc-patches here https://gcc.gnu.org/ml/gcc-patches/2017-09/msg01547.html