Re: Stack-based tracking of per-node WAL/buffer usage

Lukas Fittl Wed, 18 Mar 2026 16:36:57 -0700

On Wed, Mar 18, 2026 at 1:49 PM Zsolt Parragi <[email protected]> wrote:
>
> + instr_stack.stack_space *= 2;
> + instr_stack.entries = repalloc_array(instr_stack.entries,
> Instrumentation *, instr_stack.stack_space);
>
> Can't this also cause issues with OOM? repalloc_array failing, but we
> already doubled stack_space.
> The initialization above uses the same order, but that should be safe
> as entries is initially NULL.


Yeah, that's fair, I think its reasonable to do the update of
instr_stack.stack_space after the allocation succeeded. I also think
its good to do that even for the initialization case - it seems
extremely unlikely, but the fact that entries is NULL doesn't protect
us against a subsequent InstrPushStack running after an OOM and the
check whether InstrStackGrow should be called runs based on the stack
space, not the entries array.

I have a fix for that staged for the next iteration.

> + * 2) Accumulate all instrumentation to the currently active instrumentation,
> + *    so that callers get a complete picture of activity, even after an abort
> ...
> + if (idx >= 0)
> + {
> + while (instr_stack.stack_size > idx + 1)
> + instr_stack.stack_size--;
> +
> + InstrPopStack(instr);
> + }
>
> There seems to be one more bug in this:
>
> 1. EXPLAIN ANALYZE fires a trigger
> 2. The trigger function throws ERROR, InstrStopTrigger never runs
> 3. ResOwnerReleaseInstrumentation runs but only checks
> unfinalized_children, not triggers
> 4. InstrStopFinalize discards the trigger entry
> 5. Trigger instrumentation information shows 0

Hmm, so I think you're correct that a trigger function error would
cause any stack-based instrumentation from the trigger to get lost.

In practice that doesn't matter today, since triggers never capture
WAL/buffer usage data (only timing), but its maybe a design flaw
because trigger instrumentation is its own thing without a defined
relationship with the stack, unlike NodeInstrumentation which is
registered to the query that handles the aborts.

In a sense this is a similar situation to the EXPLAIN (SERIALIZE)
per-tuple handling we talked about previously - we have
instrumentation that's related to a query, but its not per-node, so
using NodeInstrumentation doesn't really make sense.

I could imagine three ways forward, if we want to address that now (vs
documenting that this isn't handled but effectively not a problem):

1) We add a second list for unfinalized trigger instrumentations on
QueryInstrumentation, and accumulate that too in
ResOwnerReleaseInstrumentation

2) We call InstrStopFinalize in ExecCallTriggerFunc if an error was
thrown from the trigger function (i.e. turn the existing PG_FINALLY
block into a PG_CATCH)

3) We generalize the QueryInstrumentation children handling to allow
other types of Instrumentation (i.e. not just NodeInstrumentation)

I think (3) would be ideal and would let us deal with EXPLAIN
(SERIALIZE) too, but is complicated by the fact that
ResOwnerReleaseInstrumentation needs to have reference to the full
allocated struct (not just the Instrumentation contained within) so it
can call pfree to avoid leaking memory until top transaction end.

In a prior iteration we had the Instrumentation allocated separately
inside NodeInstrumentation (so its a pointer and can thus be freed
independently / replaced with a copy on clean exit), which allows
ResOwnerReleaseInstrumentation to just deal with Instrumentation, but
that becomes inconvenient when dealing with parallel workers.

There is an alternate design I had considered, which is to basically
keep two copies of Instrumentation in NodeInstrumentation: One is a
pointer to the running instrumentation (allocated in a memory context
that survives long enough in an abort, and which will be freed upon
abort or clean exit), and one is a direct member of the containing
struct (like we have it today), and gets updated via a memcpy() upon a
clean exit. I think that'd make the API easier to use and the same
concept could then be applied to TriggerInstrumentation, but the big
downside is that we'd be doubling memory usage because whilst we're
running we'd both have the allocation in the higher memory context,
and the direct member of the containing struct (to return the result
to the caller).

Because of that, I feel like we should do (1) or (2) for now - but
I'll also wait if Andres or others have additional feedback on 0005
before proceeding with further changes.

I also do think that the 0001-0004 patches are good to be committed
unless anyone had additional feedback there.

Thanks,
Lukas

--
Lukas Fittl

Re: Stack-based tracking of per-node WAL/buffer usage

Reply via email to