https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122592
Bug ID: 122592
Summary: aarch64 adds excessive masking for uint16_t values
Product: gcc
Version: 13.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: jwerner at chromium dot org
Target Milestone: ---
When working with 16-bit halfwords on aarch64, GCC often seems to aggressively
add unnecessary masking instructions to mask out the top half of the register,
even when those bits wouldn't change the outcome of following operations. A
simple example is this:
unsigned int x(int y, unsigned short a, unsigned short b)
{
if (y)
a = ((a & 0xff) << 8) | (a >> 8);
return a + b;
}
GCC 13.2 compiles this with -Os to:
0: 12003c21 and w1, w1, #0xffff
4: 12003c42 and w2, w2, #0xffff
8: 34000060 cbz w0, 14 <x+0x14> (File Offset: 0x54)
c: 5ac00421 rev16 w1, w1
10: 12003c21 and w1, w1, #0xffff
14: 0b020020 add w0, w1, w2
18: d65f03c0 ret
Rather than masking at offset 0, it would be more efficient to only mask once
after the `rev16` instruction. One could also use the UXTH register extension
addressing mode to roll another masking instruction into the final addition. An
optimal implementation of the function could look like this:
and w1, w1, #0xffff
cbz w0, <skip next instruction>
rev16 w1, w1
add w0, w1, w2, uxth
ret
Another solution of the same length would be
cbz w0, <skip next instruction>
rev16 w1, w1
add w0, w1, w2
and w0, w0, #0xffff
ret
For comparison, clang 18.0 does this, which also doesn't seem optimal but at
least doesn't make any "useless" masking instructions like GCC:
0: 5ac00828 rev w8, w1
4: 7100001f cmp w0, #0x0
8: 53107d08 lsr w8, w8, #16
c: 1a880028 csel w8, w1, w8, eq // eq = none
10: 12003d08 and w8, w8, #0xffff
14: 0b222100 add w0, w8, w2, uxth
18: d65f03c0 ret
Another issue is that GCC seems to insist on masking even if the ABI allows the
high bits to be arbitrary. When changing the return type in the same function
to `unsigned short`, GCC still produces the same code. By comparison, clang
recognizes that it doesn't need to mask anything in that case and produces:
0: 5ac00828 rev w8, w1
4: 7100001f cmp w0, #0x0
8: 53107d08 lsr w8, w8, #16
c: 1a880028 csel w8, w1, w8, eq // eq = none
10: 0b020100 add w0, w8, w2
14: d65f03c0 ret