https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79794
Bug ID: 79794 Summary: unnecessary copy from target to target results in poor code for aarch64 Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wilson at gcc dot gnu.org Target Milestone: --- I noticed this while looking at Maxim's testcase in bug 18438 in comment 9, which is extracted from SPEC CPU2006 462.libquantum. The testcase is struct node_struct { float _Complex gap; unsigned long long state; }; struct reg_struct { int size; struct node_struct *node; }; void func(int target, struct reg_struct *reg) { int i; for(i=0; i<reg->size; i++) reg->node[i].state ^= ((unsigned long long) 1 << target); } The code generated is .L4: ld2 {v0.2d - v1.2d}, [x1] add w2, w2, 1 cmp w2, w7 eor v0.16b, v2.16b, v0.16b umov x4, v0.d[1] st1 {v0.d}[0], [x1] add x1, x1, 32 str x4, [x1, -16] bcc .L4 One half of the result is stored from vector regs (st1), and one half gets moved to general regs and then stored (umov/str), which is curious. The problem occurs when generating RTL. The final tree dump has _51 = BIT_FIELD_REF <vect__7.13_44, 64, 64>; MEM[base: vectp.8_39, offset: 16B] = _51; which is OK, but the RTL generated for this is (insn 69 68 70 6 (set (reg:DI 144) (plus:DI (reg/f:DI 130 [ vectp.8 ]) (const_int 16 [0x10]))) "tmp.c":19 -1 (nil)) (insn 70 69 71 6 (set (mem:DI (reg:DI 144) [4 MEM[base: vectp.8_39, offset: 16B\ ]+0 S8 A64]) (vec_select:DI (reg:V2DI 134 [ vect__7.13 ]) (parallel [ (const_int 1 [0x1]) ]))) "tmp.c":19 -1 (nil)) (insn 71 70 72 6 (set (reg:DI 145) (mem:DI (reg:DI 144) [4 MEM[base: vectp.8_39, offset: 16B]+0 S8 A64])) \ "tmp.c":19 -1 (nil)) (insn 72 71 73 6 (set (mem:DI (plus:DI (reg/f:DI 130 [ vectp.8 ]) (const_int 16 [0x10])) [4 MEM[base: vectp.8_39, offset: 16B]+0 \ S8 A64]) (reg:DI 145)) "tmp.c":19 -1 (nil)) What happened here is that we called extract_bit_field with a vector register source and a MEM target using a reg+offset addressing mode. Aarch64 does have an instruction to store part of a vector register, but it only accepts a register addressing mode. So we have to fix the address and generate the store. We then return to expand_expr which does a test ! rtx_equal_p (temp, target) sees that the RTL is different because the addressing mode changed, and proceeds to copy from the target into a temp reg and back out to the target again. Later, the DSE pass removes the first (good) store, leaving the second (bad) store, forcing the register allocator to emit an instruction to copy the value from a vector reg to a general reg. With a tentative patch to fix this, I instead get this code .L4: ld2 {v0.2d - v1.2d}, [x1] add x2, x1, 16 add w3, w3, 1 cmp w3, w5 eor v0.16b, v2.16b, v0.16b st1 {v0.d}[0], [x1] add x1, x1, 32 st1 {v0.d}[1], [x2] bcc .L4 and w4, w4, -2 which looks better, as both halves of the result are stored with vector stores (st1). It is the same number of instructions, with the umov replaced with an address add. Preliminary testing shows that this is faster on some aarch64 machines (e.g. xgene1), and the same speed on others (e.g. kryo). It depends on the speed of the umov instruction, which is slower than an add on some aarch64 parts.