Hi Robert,

Thanks for the pg_plan_advice patches!  I've tried hard to attack them, to
get
them to fail in some catastrophic way.  You seem to have hardened the code
well.  I found only one concern for you to consider, a kind of memory leak
ratchet:

Once the system reaches memory pressure where:
  - The 8192-byte DSA size class is exhausted (needs a new DSM segment, OS
refuses)
  - Smaller size classes still have free space in existing superblocks

Then every single query that triggers advice collection will:
  1. Successfully allocate an advice entry from existing free space
  2. Enter store_shared_advice, hit the same chunk boundary
  3. Fail to allocate the chunk
  4. Leak the advice entry
  5. Reduce remaining free space in the small size classes

This continues until the small size classes are also exhausted, at which
point
make_collected_advice itself fails (the DSA area has been consumed by leaked
entries). The ratchet is self-reinforcing: each failure guarantees the next
failure while consuming more resources, assuming nobody else is freeing
memory simultaneously.

I looked for situations where something inside postgres would keep retrying
after the OOM, but the most likely I think is just an application that
treats
OOM as a transient error and keeps retrying.

See the make_collected_advice() call in pg_collect_advice_save(); the point
where DSA memory is allocated but not yet linked into any data structure.
Everything downstream from here (the four dsa_allocate0 calls inside
store_shared_advice) can fail and leak it.

On Thu, Mar 26, 2026 at 10:20 AM Robert Haas <[email protected]> wrote:

> On Thu, Mar 26, 2026 at 9:55 AM Robert Haas <[email protected]> wrote:
> > Whoops. Obviously got the wrong thing stuck in my cut and paste buffer
> > when I was writing that. Thanks for checking it. I'm going to go ahead
> > and commit this, because I'm pretty confident that it's correct, and
> > the rest of these patches are not going to fix the buildfarm
> > instability without it, and I'm pretty sure multiple committers are
> > pretty tired of seeing these test_plan_advice failures already.
>
> Done.
>
> > Right, the comment isn't quite correct. I don't think your rewording
> > is quite right either, though, because there's really no reason to
> > mention plan_name here at all. I'll adjust it.
>
> Done and committed, after also adjusting the memory context handling
> to avoid re-breaking GEQO.
>
> > The dangling pointers are a good point; I agree that's bad. However,
> > I'd be more inclined to fix it by nulling out the alternative_root
> > pointers at the end of set_plan_references. I think that would just be
> > the case where root->isAltSubplan[ndx] && root->isUsedSubplan[ndx].
> > The reason I'm reluctant to just store the name is that there's not an
> > easy way to find a PlannerInfo by name. I originally proposed an
> > "allroots" list in PlannerGlobal, but we went with subplanNames on
> > Tom's suggestion. I subsequently realized that this kind of stinks for
> > code that is trying to use this infrastructure for anything, for
> > exactly this reason, but Tom never responded and I never pressed the
> > issue. But I think we're boxing ourselves into a corner if we just
> > keep storing names that can't be looked up everywhere. It doesn't
> > matter for the issue before us, so maybe doing as you say here is the
> > right idea just so we can move forward, but I think we're probably
> > kidding ourselves a little bit.
>
> Here's a new version, where I've replaced alternative_root by
> alternative_plan_name, serving the same function.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>


-- 

*Mark Dilger*

Reply via email to