Hello all,

I'm not sure whether this has been posted before, but gcc creates
slightly inefficient code for large integers in several cases:

unsigned long long val;

void example1() {
    val += 0x800000000000ULL;
}

On x86 this results in the following assembly:
addl $0, val
adcl $32768, val+4
ret

The first add is unnecessary as it shouldn't modify val or set the carry.
This isn't too bad, but compiling for a something like AVR, results in
8 byte loads, followed by three additions (of the high bytes),
followed by another 8 byte saves.
The compiler doesn't recognize that 5 of those loads and 5 of those
saves are unnecessary.
Replacing the addition, with bitwise or/xor also produces an
unnecessary instruction on x86, but produces optimal instructions on
an AVR.


Here is another inefficiency for x86:

unsigned long long val = 0;
unsigned long small = 0;

unsigned long long example1() {
    return val | small;
}

unsigned long long example2() {
    return val & small;
}

This produces for example1 (bad):
movl small, %ecx
movl val, %eax
movl val+4, %edx
pushl %ebx
xorl %ebx, %ebx
orl %ecx, %eax
orl %ebx, %edx
popl %ebx
ret

For example2 (good):
movl small, %eax
xorl %edx, %edx
andl val, %eax
ret


The RTL's generated for example1 and example2 are very similar until
the fwprop1 stage.
Since the largest word size on x86 is 4 bytes, each operation is
actually split into two.
The forward propagator correctly realizes that anding the upper 4
bytes results in a zero.
However, it doesn't seem to recognize that oring the upper 4 bytes
should return val's high word.
This problem also occurs in the xor operation, and also when
subtracting (val - small).

All programs were compiled with "-O2 -Wall" although I also tried -O3
and -Os with the same result.

Thanks for any help.

Reply via email to