Hi,
On 2026-02-25 13:05:14 -0500, Andres Freund wrote:
> At least gcc is doing some truly weird shit in the
> firstNonGuaranteed/firstNonCachedOffsetAttr loop "header" (i.e. just before
> the first entrance to the loop) , which leads to the register pressure being
> high, which leads to spilling on the stack, making the few-tuples case slower:
>
> [ lots of stuff trimmed ]
>
> I.e. the compiler creates an offset version of tts_values[tts_nvalid],
> tts_isnull[tts_nvalid], which then creates register allocation pressure,
> because later the original tts_values/tts_isnulll etc are accessed again and
> thus the underlying registers are preserved. And this is all for zero gain,
> from what I can tell, because the acceses are still done with indexed
> addressing (like mov %rdi,(%r12,%rcx,8)), which would work just as
> well if rcx were indexed based on attnum, not zero indexed within the loop.
>
> I see about a 10% improvement if I dissuade the compiler from doing that by
> adding
> __asm__ volatile ("" : "+r"(attnum) : :);
>
> In the loop body.
>
>
> I'm getting to the point where I'd like to just hand write the assembler for
> this stupid function. Gah.
Huh. It, at least partially, seems to be related to using an integer for
attnum et al. Due to us using -fwrapv, the compiler can't actually assume that
an attnum++ won't overflow. An overflow would make the loop trip counts a lot
more complicated. Even with that I don't understand how it ends up
generating such crappy code, but since using size_t fixes it...
Greetings,
Andres Freund