Hi Tom, On Tue, May 20, 2025 at 12:06 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > My attention was drawn to commit 525392d57 after observing that > Valgrind complained about a memory leak in some code that commit added > to BuildCachedPlan(). I tried to make sense of said code so I could > remove the leak, and eventually arrived at the attached patch, which > is part of a series of leak-fixing things hence the high sequence > number. > > Unfortunately, the bad things I speculated about in the added comments > seem to be reality. The second attached file is a test case that > triggers > > TRAP: failed Assert("list_length(plan_list) == > list_length(plan->stmt_list)"), File: "plancache.c", Line: 1259, PID: 602087 > > because it adds a DO ALSO rule that causes the rewriter to generate > more PlannedStmts than it did before. > > This is quite awful, because it does more than simply break the klugy > (and undocumented) business about keeping the top-level List in a > different context. What it means is that any outside code that is > busy iterating that List is very fundamentally broken: it's not clear > what List index it ought to resume at, except that "the one it was at" > is demonstrably incorrect. > > I also don't really believe the (also undocumented) assumption that > such outside code is in between executions of PlannedStmts of the > List and hence can tolerate those being ripped out and replaced. > I have not attempted to build an example, because the one I have > seems sufficiently damning. But I bet that a recursive function > could be constructed in such a way that an outer execution is > still in progress when an inner call triggers UpdateCachedPlan. > > Another small problem (much more easily fixable than the above, > probably) is that summarily setting "plan->is_valid = true" > at the end is not okay. We could already have received an > invalidation that should result in marking the plan stale. > (Holding locks on the tables involved is not sufficient to > prevent that, as there are other sources of inval events.)
Thanks for pointing out the hole in the current handling of CachedPlan->stmt_list. You're right that the approach of preserving the list structure while replacing its contents in-place doesn’t hold up when the rewriter adds or removes statements dynamically. There might be other cases that neither of us have tried. I don’t think that mechanism is salvageable. To address the issue without needing a full revert, I’m considering dropping UpdateCachedPlan() and removing the associated MemoryContext dance to preserve CachedPlan->stmt_list structure. Instead, the executor would replan the necessary query into a transient list of PlannedStmts, leaving the original CachedPlan untouched. That avoids mutating shared plan state during execution and still enables deferred locking in the vast majority of cases. There are two variants of this approach. In the simpler form, the transient PlannedStmt list exists only in executor-local memory and isn’t registered with the invalidation machinery. That might be acceptable in practice, since all referenced relations are locked at that point -- but it would mean any invalidation events delivered during execution are ignored. The more robust variant is to build a one-query standalone CachedPlan using something like GetTransientCachedPlanForQuery(), which I had proposed back in [1]. This gets added to a standalone_plan_list so that invalidation callbacks can still reach it. I dropped that design earlier [2] due to the cleanup overhead, but I’d be happy to bring it back in a simplified form if that seems preferable. One open question in either case is what to do if the number of PlannedStmts in the rewritten plan changes as with your example. Would it be reasonable to just go ahead and execute the additional statements from the transient plan, even though the original CachedPlan wouldn’t have known about them until the next use? That would avoid introducing any new failure behavior while still handling the invalidation correctly for the current execution. > It's possible that this code can be fixed, but I fear it's > going to involve some really fundamental redesign, which > probably shouldn't be happening after beta1. I think there > is no alternative but to revert for v18. ...Beyond that, I think I’ve run out of clean options for making deferred locking executor-local while keeping invalidation safe. I know you'd previously objected (with good reason) to making GetCachedPlan() itself run pruning logic to determine which partitions to lock -- and to the idea of carrying or sharing the result of that pruning back to the executor via interface changes in the path from plancache.c through its callers down to ExecutorStart(). So I’ve steered away from revisiting that direction. But if we’re not comfortable with either of the transient replanning options, then we may end up shelving the deferred locking idea entirely -- which would be unfortunate, given how much it helps workloads that rely on generic plans over large partitioned tables. Let me know what you think -- I’ll hold off on posting a revert or a replacement until we’ve agreed on the path forward. -- Thanks, Amit Langote [1] https://www.postgresql.org/message-id/CA%2BHiwqGSOge3eT3kcm_nxCSA3Ut%2Bd0jtchi8g8J9uXi-kyC7Jw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CA%2BHiwqHRRFQN6yZ54fBydOTM6ncqZBCmewZ6n519RjRdDsO44g%40mail.gmail.com