https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79794

            Bug ID: 79794
           Summary: unnecessary copy from target to target results in poor
                    code for aarch64
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wilson at gcc dot gnu.org
  Target Milestone: ---

I noticed this while looking at Maxim's testcase in bug 18438 in comment 9,
which is extracted from SPEC CPU2006 462.libquantum.  The testcase is

struct node_struct
{
  float _Complex gap;
  unsigned long long state;
};

struct reg_struct
{
  int size;
  struct node_struct *node;
};

void
func(int target, struct reg_struct *reg)
{
  int i;

  for(i=0; i<reg->size; i++)
    reg->node[i].state ^= ((unsigned long long) 1 << target);
}

The code generated is 
.L4:
        ld2     {v0.2d - v1.2d}, [x1]
        add     w2, w2, 1
        cmp     w2, w7
        eor     v0.16b, v2.16b, v0.16b
        umov    x4, v0.d[1]
        st1     {v0.d}[0], [x1]
        add     x1, x1, 32
        str     x4, [x1, -16]
        bcc     .L4
One half of the result is stored from vector regs (st1), and one half gets
moved to general regs and then stored (umov/str), which is curious.

The problem occurs when generating RTL.  The final tree dump has
  _51 = BIT_FIELD_REF <vect__7.13_44, 64, 64>;
  MEM[base: vectp.8_39, offset: 16B] = _51;
which is OK, but the RTL generated for this is
(insn 69 68 70 6 (set (reg:DI 144)
        (plus:DI (reg/f:DI 130 [ vectp.8 ])
            (const_int 16 [0x10]))) "tmp.c":19 -1
     (nil))
(insn 70 69 71 6 (set (mem:DI (reg:DI 144) [4 MEM[base: vectp.8_39, offset:
16B\
]+0 S8 A64])
        (vec_select:DI (reg:V2DI 134 [ vect__7.13 ])
            (parallel [
                    (const_int 1 [0x1])
                ]))) "tmp.c":19 -1
     (nil))
(insn 71 70 72 6 (set (reg:DI 145)
        (mem:DI (reg:DI 144) [4 MEM[base: vectp.8_39, offset: 16B]+0 S8 A64]))
\
"tmp.c":19 -1
     (nil))
(insn 72 71 73 6 (set (mem:DI (plus:DI (reg/f:DI 130 [ vectp.8 ])
                (const_int 16 [0x10])) [4 MEM[base: vectp.8_39, offset: 16B]+0
\
S8 A64])
        (reg:DI 145)) "tmp.c":19 -1
     (nil))
What happened here is that we called extract_bit_field with a vector register
source and a MEM target using a reg+offset addressing mode.  Aarch64 does have
an instruction to store part of a vector register, but it only accepts a
register addressing mode.  So we have to fix the address and generate the
store.  We then return to expand_expr which does a test
     ! rtx_equal_p (temp, target)
sees that the RTL is different because the addressing mode changed, and
proceeds to copy from the target into a temp reg and back out to the target
again.  Later, the DSE pass removes the first (good) store, leaving the second
(bad) store, forcing the register allocator to emit an instruction to copy the
value from a vector reg to a general reg.

With a tentative patch to fix this, I instead get this code
.L4:
        ld2     {v0.2d - v1.2d}, [x1]
        add     x2, x1, 16
        add     w3, w3, 1
        cmp     w3, w5
        eor     v0.16b, v2.16b, v0.16b
        st1     {v0.d}[0], [x1]
        add     x1, x1, 32
        st1     {v0.d}[1], [x2]
        bcc     .L4
        and     w4, w4, -2
which looks better, as both halves of the result are stored with vector stores
(st1).  It is the same number of instructions, with the umov replaced with an
address add.

Preliminary testing shows that this is faster on some aarch64 machines (e.g.
xgene1), and the same speed on others (e.g. kryo).  It depends on the speed of
the umov instruction, which is slower than an add on some aarch64 parts.

Reply via email to