https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122592

            Bug ID: 122592
           Summary: aarch64 adds excessive masking for uint16_t values
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jwerner at chromium dot org
  Target Milestone: ---

When working with 16-bit halfwords on aarch64, GCC often seems to aggressively
add unnecessary masking instructions to mask out the top half of the register,
even when those bits wouldn't change the outcome of following operations. A
simple example is this:

unsigned int x(int y, unsigned short a, unsigned short b)                       
{                                                                               
  if (y)                                                                        
    a = ((a & 0xff) << 8) | (a >> 8);                                           
  return a + b;                                                                 
}

GCC 13.2 compiles this with -Os to:

   0:   12003c21        and     w1, w1, #0xffff
   4:   12003c42        and     w2, w2, #0xffff
   8:   34000060        cbz     w0, 14 <x+0x14> (File Offset: 0x54)
   c:   5ac00421        rev16   w1, w1
  10:   12003c21        and     w1, w1, #0xffff
  14:   0b020020        add     w0, w1, w2
  18:   d65f03c0        ret

Rather than masking at offset 0, it would be more efficient to only mask once
after the `rev16` instruction. One could also use the UXTH register extension
addressing mode to roll another masking instruction into the final addition. An
optimal implementation of the function could look like this:

and w1, w1, #0xffff
cbz w0, <skip next instruction>
rev16 w1, w1
add w0, w1, w2, uxth
ret

Another solution of the same length would be

cbz w0, <skip next instruction>
rev16 w1, w1
add w0, w1, w2
and w0, w0, #0xffff
ret

For comparison, clang 18.0 does this, which also doesn't seem optimal but at
least doesn't make any "useless" masking instructions like GCC:

   0:   5ac00828        rev     w8, w1
   4:   7100001f        cmp     w0, #0x0
   8:   53107d08        lsr     w8, w8, #16
   c:   1a880028        csel    w8, w1, w8, eq  // eq = none
  10:   12003d08        and     w8, w8, #0xffff
  14:   0b222100        add     w0, w8, w2, uxth
  18:   d65f03c0        ret

Another issue is that GCC seems to insist on masking even if the ABI allows the
high bits to be arbitrary. When changing the return type in the same function
to `unsigned short`, GCC still produces the same code. By comparison, clang
recognizes that it doesn't need to mask anything in that case and produces:

   0:   5ac00828        rev     w8, w1
   4:   7100001f        cmp     w0, #0x0
   8:   53107d08        lsr     w8, w8, #16
   c:   1a880028        csel    w8, w1, w8, eq  // eq = none
  10:   0b020100        add     w0, w8, w2
  14:   d65f03c0        ret

Reply via email to