Re: Unstable tests for recovery conflict handling

Andres Freund Wed, 03 Aug 2022 15:08:17 -0700

Hi,

On 2022-08-03 16:33:46 -0400, Robert Haas wrote:
> On Wed, Aug 3, 2022 at 1:57 PM Andres Freund <and...@anarazel.de> wrote:
> > The reason nothing might get logged in some cases is that
> > e.g. ResolveRecoveryConflictWithLock() tells
> > ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting:
> >                 /*
> >                  * Prevent ResolveRecoveryConflictWithVirtualXIDs() from 
> > reporting
> >                  * "waiting" in PS display by disabling its argument 
> > report_waiting
> >                  * because the caller, WaitOnLock(), has already reported 
> > that.
> >                  */
> >
> > so ResolveRecoveryConflictWithLock() can end up looping indefinitely without
> > logging anything.
> 
> I understand why we need to avoid adding "waiting" to the PS status
> when we've already done that, but it doesn't seem like that should
> imply skipping ereport() of log messages.
> 
> I think we could redesign the way the ps display works to make things
> a whole lot simpler. Let's have a function set_ps_display() and
> another function set_ps_display_suffix(). What gets reported to the OS
> is the concatenation of the two. Calling set_ps_display() implicitly
> resets the suffix to empty.
> 
> AFAICS, that'd let us get rid of this tricky logic, and some other
> tricky logic as well. Here, we'd just say set_ps_display_suffix("
> waiting") and not worry about whether the caller might have already
> done something similar.


That sounds like it'd be an improvement.  Of course we still need to fix that
we can signal at a rate not allowing the other side to handle the conflict,
but at least that'd be easier to identify...


> > Another question I have about ResolveRecoveryConflictWithLock() is whether
> > it's ok that we don't check deadlocks around the
> > ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd
> > only block if there's a recovery conflict, in which killing the process 
> > ought
> > to succeed?
> 
> The startup process is supposed to always "win" in any deadlock
> situation, so I'm not sure what you think is a problem here. We get
> the conflicting lockers. We kill them. If they don't die, that's a
> bug, but killing ourselves doesn't really help anything; if we die,
> the whole system goes down, which seems undesirable.

The way deadlock timeout for the startup process works is that we wait for it
to pass and then send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK to the
backends. So it's not that the startup process would die.

The question is basically whether there are cases were
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK would resolve a conflict but
PROCSIG_RECOVERY_CONFLICT_LOCK wouldn't. It seems plausible that there isn't,
but it's also not obvious enough that I'd fully trust it.

Greetings,

Andres Freund

Re: Unstable tests for recovery conflict handling

Reply via email to