Re: Fix assertion in autovacuum worker

Andres Freund Tue, 28 Nov 2023 18:49:21 -0800

Hi,

On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote:
> On Tue, Nov 28, 2023 at 04:03:49PM -0800, Andres Freund wrote:
> > On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote:
> >> From a glance, it looks to me like the problem is that pgstat_shutdown_hook
> >> is registered as a before_shmem_exit callback, while ProcKill is registered
> >> as an on_shmem_exit callback.
> > 
> > That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which 
> > you
> > can't after ProcKill(). It's also not unique to pgstat, several other
> > before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(),
> > ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of
> > before_shmem_exit when git grepping) - which makes sense, they are 
> > presumably
> > before_shmem_exit() because they need to manage shared state, which often
> > needs locks.
> > 
> > In normal backends this is fine-ish, because ShutdownPostgres() is 
> > registered
> > very late (and thus is called early in the shutdown sequence), and the
> > AbortOutOfAnyTransaction() contained therein indirectly calls
> > LWLockReleaseAll() and very little happens outside of the transaction 
> > context.
> 
> Right.  Perhaps we could add a LWLockReleaseAll() to
> pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
> is still just a hack.


Yea, we'd need that in just about all before_shmem_exit() callbacks.  I could
see an argument for doing it in proc_exit_prepare(). While that'd be a fairly
gross layering violation, we already do reset a number a bunch of stuff in
there:
        /*
         * Forget any pending cancel or die requests; we're doing our best to
         * close up shop already.  Note that the signal handlers will not set
         * these flags again, now that proc_exit_inprogress is set.
         */
        InterruptPending = false;
        ProcDiePending = false;
        QueryCancelPending = false;
        InterruptHoldoffCount = 1;
        CritSectionCount = 0;


> >> I would expect your patch to fix this particular issue, but I'm wondering
> >> whether there's a bigger problem here.
> > 
> > Yes, there is - our subsystem initialization, shutdown, error recovery
> > infrastructure is a mess.  We've interwoven transaction handling far too
> > tightly with error handling, the order of subystem initialization is 
> > basically
> > random and differs between operating systems (due to EXEC_BACKEND) and 
> > "mode"
> > of execution (the order differs when using single user mode) and we've
> > distributed error recovery into ~10 places (all the sigsetjmp()s in backend
> > code, xact.c and and a few other places like WalSndErrorCleanup()).
> 
> :(
> 
> I do remember looking into uniting all the various sigsetjmp() calls
> before.  That could be worth another try.  The rest will probably require
> additional thought...

It'd definitely be worth some effort. I'm quite sure that we have a number of
hard to find bugs around this.

Greetings,

Andres Freund

Re: Fix assertion in autovacuum worker

Reply via email to