Hi Heikki,
On Wed, Mar 25, 2026 at 3:47 AM Heikki Linnakangas <[email protected]> wrote:
>
> On 24/03/2026 08:03, Lukas Fittl wrote:
> > Instead I've tried introducing a memory context for instrumentation
> > managed as a resource owner, and I am now (for now) convinced that
> > this is the right trade-off for the problem at hand.
>
> Yes, that seems better.
Thanks for reviewing!
> This patch could use an overview README file, I'm struggling to
> understand how the this all works. Here's my understanding so far,
> please correct me if I'm wrong:
Sure, happy to put this together - I wonder where would place that
best - probably src/backend/executor/README.instrument ?
> There are *two* data structures tracking the Instrumentation nodes. The
> patch only talks about a stack, but I think there's also implicitly a
> tree in there.
>
> Tree
> ----
>
> All Instrumentation nodes are part of a tree. For example, if you have
> two portals open, the tree might look like this:
>
> Session - Query A - NestLoop - Seq Scan A
> - Seq Scan B
>
> - Query B - Seq Scan C
>
> When a node is "finalized", its counters are added to its parent.
>
> This tree is a somewhat implicit in the patch. Each QueryInstrumentation
> has a list of child nodes, but only unfinalized ones. Don't we need that
> at the session level too? When a Query is released on abort, its
> counters need to be added to the parent too. If I understand correctly,
> the patch tries to use the stack for that, but it's confusing.
If I follow you correctly, we're talking about the work that
InstrStopFinalize is doing (both in regular flow, and in abort):
void
InstrStopFinalize(Instrumentation *instr)
{
...
InstrAccumStack(instr_stack.current, instr);
}
The "instr_stack.current" global referenced here is effectively the
instrumentation that was active before InstrStart was called, and
would be a parent in the tree in that sense.
Its worth noting that on abort we don't care about the tree structure
below the aborted activity, i.e. each QueryInstrumentation acts as a
finalization point of sorts, with any tree structure below (e.g. that
of executor nodes) not being finalized to their respective parents,
but instead just getting added to the QueryInstrumentation they were
attached to (which then gets finalized into "instr_stack.current").
> I think it would make the patch more clear to talk explicitly about the
> tree, and represent it explicitly in the Instrumentation nodes. I.e. add
> a "parent" pointer, or a "children" list, or both to the Instrumentation
> struct.
I'm happy to clarify the mechanism, but I'm hesitant to add more
pointers to Instrumentation, since its a base struct that gets re-used
in different places, and also gets copied to parallel workers (so any
pointer requires extra scrutiny to avoid mis-use) - and I don't think
we actually need to track the parent pointer, since in practice it
will always be the current stack entry.
>
>
> Stack
> -----
>
> At all times, there's a stack that tracks what is the Instrumentation in
> the tree that is *currently* executing. For example, while executing the
> Seq Scan B, the stack would look like this:
>
> 0: Session
> 1: Query A
> 2: NestLoop
> 3: Seq Scan B
>
> And when the code is sending a result row back to the client, while the
> query is being executed, the stack would be just:
>
> 0: Session
>
>
> In the patch, the stack is represented by an array. It could also be
> implemented with a CurrentInstrumentation global variable, similar to
> CurrentMemoryContext and CurrentResourceOwner.
It could be, and in fact earlier iterations were closer to that, but I
modified that to the current version based on Andres' feedback - The
array structure is a lot easier to work with during abort when things
execute out-of-order (as you also note later).
>
>
> Abort handling
> --------------
>
> On abort, two things need to happen:
>
> 1. Reset the stack to the appropriate level. This ensures that any we
> don't later try to update the counters on an Instrumentation nodes that
> is going away with the abort. In the above example, the stack would be
> reset to the 0: Session level.
Correct, but just to clarify, the main problem we deal with in terms
of reset is making sure that we cancel out any InstrPushStack that was
done, but will now no longer have a matching InstrPopStack getting
called.
We need to make sure that we get to the stack entry that was active
before the aborted activity started.
> 2. Finalize all the Instrumentation nodes as part of the ResourceOwner
> cleanup. All Instrumentation nodes that are released roll up their
> counters to their parents.
>
>
> Questions:
>
> Is the stack always a path from the root of the tree, down to some node?
> Or could you have e.g. recursion like A -> B -> C -> A? (I don't know if
> it makes a difference, just wondering)
I don't think this happens in practice - I think for the stack itself
that'd probably be fine (since you're just putting the entry back on
that was on before, in a sense), but it'd e.g. result in timer
instrumentation behaving incorrectly.
We could probably add an assert to explicitly prevent that if we're
worried about it, but the existing Start/Stop instrumentation calls
haven't seen this issue I think, and they'd already have had a problem
with that.
> What happens if you release e.g. the NestLoop before its children? All
> the Instrumentation nodes belonging to a query would usually be part of
> the same ResourceOwner and there's no guarantee on what order the
> resources are released.
Correct, and in fact prior versions of the patch struggled with that
exact problem. Its both an issue for resource owner managed cleanup,
and when you have PG_FINALLY in the picture (e.g. pg_stat_statements).
But that's exactly why its not really a full tree - in the abort case
we do not care about the relationship of child instrumentations
underneath the QueryInstrumentation - we just make sure that the stack
is reset to the entry that was active before the QueryInstrumentation
started, and that all activity that occurred is added to the
QueryInstrumentation.
If you had a situation that had two QueryInstrumentations active (i.e.
both registered as resource owner), we go up to whichever one of the
two is higher up in the stack, per logic in InstrStopFinalize.
Thanks for thinking this through & hopefully this clarifies things a bit?
Thanks,
Lukas
--
Lukas Fittl