Re: More speedups for tuple deformation

Andres Freund Wed, 25 Feb 2026 12:29:23 -0800

Hi,

On 2026-02-25 13:05:14 -0500, Andres Freund wrote:
> At least gcc is doing some truly weird shit in the
> firstNonGuaranteed/firstNonCachedOffsetAttr loop "header" (i.e. just before
> the first entrance to the loop) , which leads to the register pressure being
> high, which leads to spilling on the stack, making the few-tuples case slower:
>
> [ lots of stuff trimmed ]
> 
> I.e. the compiler creates an offset version of tts_values[tts_nvalid],
> tts_isnull[tts_nvalid], which then creates register allocation pressure,
> because later the original tts_values/tts_isnulll etc are accessed again and
> thus the underlying registers are preserved.  And this is all for zero gain,
> from what I can tell, because the acceses are still done with indexed
> addressing  (like  mov           %rdi,(%r12,%rcx,8)), which would work just as
> well if rcx were indexed based on attnum, not zero indexed within the loop.
> 
> I see about a 10% improvement if I dissuade the compiler from doing that by
> adding
>   __asm__ volatile ("" : "+r"(attnum) : :);
> 
> In the loop body.
> 
> 
> I'm getting to the point where I'd like to just hand write the assembler for
> this stupid function. Gah.


Huh.  It, at least partially, seems to be related to using an integer for
attnum et al. Due to us using -fwrapv, the compiler can't actually assume that
an attnum++ won't overflow. An overflow would make the loop trip counts a lot
more complicated.   Even with that I don't understand how it ends up
generating such crappy code, but since using size_t fixes it...

Greetings,

Andres Freund

Re: More speedups for tuple deformation

Reply via email to