Hi,
On 2026-02-25 13:59:53 +1300, David Rowley wrote:
> On Wed, 25 Feb 2026 at 03:39, Andres Freund <[email protected]> wrote:
> > ISTM we should just merge 0004. In my testing it's a very clear win,
> > without,
> > afaict, any downsides.
>
> I'd like to get them in in sequence as I believe 0004 buys back some
> extra overheads such as the Min()s in slot_deform_heap_tuple(). If I
> were to do 0004 first, then wait a while, it might look more like I'm
> introducing a small regression.
I'd rather get more stuff merged out of the way. We can handle a small
regression by comparing to 18 instead. I think we tend to be too worried
about intra-master regressions, and not worried enough about intra branch
regressions.
But FWIW, I don't actually see any regressions due to the Min() on my hardware
as the patches stand.
I do actually see a small regression in some cases due to 0004, but that's
easy to fix:
/* Fetch any missing attrs and raise an error if reqnatts is invalid */
if (unlikely(attnum < reqnatts))
slot_getmissingattrs(slot, attnum, reqnatts);
done:
/* Save current offset for next execution */
*offp = off;
This forces pushing a bunch of stack state before the call to
slot_getmissingattrs, doing the call, returning, restoring state from the
stack, just to then jump to the end of the function to set *offp which then
again does a bunch of stack restoring and returns.
If you instead make it something like:
/* Fetch any missing attrs and raise an error if reqnatts is invalid */
if (unlikely(attnum < reqnatts))
{
*offp = off;
slot_getmissingattrs(slot, attnum, reqnatts);
return;
}
(or just duplicate the *offp = off in the goto done; case)
the overhead seems to be removed.
At least gcc is doing some truly weird shit in the
firstNonGuaranteed/firstNonCachedOffsetAttr loop "header" (i.e. just before
the first entrance to the loop) , which leads to the register pressure being
high, which leads to spilling on the stack, making the few-tuples case slower:
r9 is tts_nvalid, r11 is firstNonGuaranteedAttr, r15 is
slot->tts_tupleDescriptor, rbx is tts_values, -0x10(%rsp) is tts_isnull, rax
is tup.
if (attnum < firstNonGuaranteedAttr)
cmp %r9d,%r11d
jle <tts_buffer_heap_getsomeattrs+0x3c0>
lea 0x0(,%r9,8),%r12
multiply r9/tts_nvalid by 8, store in r12
sub %r9d,%r11d
r11=firstNonGuaranteedAttr-tts_nvalid
xor %ecx,%ecx
Set ecx to 0
lea 0x20(%r15,%r12,1),%rdx
compute (char *) tupleDesc->compact_attrs + r12, i.e. the address of the
first
needed compact attribute
add %rbx,%r12
compute (char*) tts_values + tts_nvalid * sizeof(Datum)
mov -0x10(%rsp),%rbx
restore tts_isnull from stack
lea (%rbx,%r9,1),%rsi
compute tts_isnull[tts_nvalid]
jmp <tts_buffer_heap_getsomeattrs+0xee>
Then the loop body:
...
movb $0x0,(%rsi,%rcx,1)
set (tts_isnull + tts_nvalid)[i] = 0
mov %rdx,%r13
movswl (%rdx),%edi
load cattr->attcacheoff into rdi/edi
movzwl 0x2(%rdx),%r10d
load attlen into r10
mov %rdi,%rbx
add %rax,%rdi
rdi = tup + attcacheoff
[bunch of attlen related branches omitted]
movslq (%rdi),%rdi
store *(tup + off) in rdi
mov %rdi,(%r12,%rcx,8)
store rdi in tts_values[i]
lea 0x1(%rcx),%rdi
increment i by one, in a weird way, store in rdi
add $0x8,%rdx
increment cattr by 8
cmp %rdi,%r11
check if i < (firstNonGuaranteedAttr-tts_nvalid)
I.e. the compiler creates an offset version of tts_values[tts_nvalid],
tts_isnull[tts_nvalid], which then creates register allocation pressure,
because later the original tts_values/tts_isnulll etc are accessed again and
thus the underlying registers are preserved. And this is all for zero gain,
from what I can tell, because the acceses are still done with indexed
addressing (like mov %rdi,(%r12,%rcx,8)), which would work just as
well if rcx were indexed based on attnum, not zero indexed within the loop.
I see about a 10% improvement if I dissuade the compiler from doing that by
adding
__asm__ volatile ("" : "+r"(attnum) : :);
In the loop body.
I'm getting to the point where I'd like to just hand write the assembler for
this stupid function. Gah.
> > > I'm not getting great results from benchmarking the 0005 patch. I
> > > verified that gcc does access the array without calculating the
> > > element address from scratch each time and calculates it once, then
> > > increments the pointer by sizeof(CompactAttribute). See the attached
> > > .csv for the results on the 3 machines I tested on.
> >
> > FWIW, where I had seen that be rather beneficial is the
> > TupleDescCompactAttr()
> > at the start of the various loops, where the compiler has little choice to
> > compute the address of the tupdesc->compact_attrs[firstNeededCol]. That
> > matters only when only deforming a small number of columns, of course.
>
> oh ok. I wasn't aware that LEA's scaling factor can only be 1,2 4 or
> 8. With the 8-byte struct, the compiler should be able to do the shift
> and add as one operation, whereas with the 16-byte struct would
> require a separate shift and add.
>
> Looking at the generated code, with 0004, I see:
>
> 1c79: 48 c1 e2 04 shl rdx,0x4
> 1c7d: 48 8d 4c 15 20 lea rcx,[rbp+rdx*1+0x20]
>
> whereas with 0005 I see:
>
> 1c6b: 4a 8d 1c dd 00 00 00 lea rbx,[r11*8+0x0]
>
> Is that what you meant?
Yep.
> > > I've also resequenced the patches to make the deform_bench test module
> > > part of the 0001 patch. This makes it easier to test the performance
> > > of master.
> >
> > What are your thoughts about merging the deform_bench tooling? I wonder if
> > we
> > should have src/test/modules/benchmark_tools or such, so we can add a few
> > more
> > micro-benchmarky tools over time?
>
> I'd like to see us give these tools a proper home. It helps lower the
> bar for anyone else who'd like to experiment at some future date, and
> also allows people to more easily test for performance regressions if
> they're forced to change related code. I've also got a tool that
> benchmarks the MemoryContext code which I keep in some local repo that
> I dig out from time to time. Given that, it's probably unlikely
> deform_bench would be the only extension in there if we did make a
> directory for these.
I'd probably lean towards one extensions with different functions for
different purposes, but that's boring details.
> On the otherhand, it does add some maintenance overhead, but IMO,
> helping to ensure various key routines are optimal is a worthy enough
> cause to make the maintenance overhead worthwhile.
Agreed.
Greetings,
Andres Freund