Hi, On 2022-08-03 16:33:46 -0400, Robert Haas wrote: > On Wed, Aug 3, 2022 at 1:57 PM Andres Freund <and...@anarazel.de> wrote: > > The reason nothing might get logged in some cases is that > > e.g. ResolveRecoveryConflictWithLock() tells > > ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting: > > /* > > * Prevent ResolveRecoveryConflictWithVirtualXIDs() from > > reporting > > * "waiting" in PS display by disabling its argument > > report_waiting > > * because the caller, WaitOnLock(), has already reported > > that. > > */ > > > > so ResolveRecoveryConflictWithLock() can end up looping indefinitely without > > logging anything. > > I understand why we need to avoid adding "waiting" to the PS status > when we've already done that, but it doesn't seem like that should > imply skipping ereport() of log messages. > > I think we could redesign the way the ps display works to make things > a whole lot simpler. Let's have a function set_ps_display() and > another function set_ps_display_suffix(). What gets reported to the OS > is the concatenation of the two. Calling set_ps_display() implicitly > resets the suffix to empty. > > AFAICS, that'd let us get rid of this tricky logic, and some other > tricky logic as well. Here, we'd just say set_ps_display_suffix(" > waiting") and not worry about whether the caller might have already > done something similar.
That sounds like it'd be an improvement. Of course we still need to fix that we can signal at a rate not allowing the other side to handle the conflict, but at least that'd be easier to identify... > > Another question I have about ResolveRecoveryConflictWithLock() is whether > > it's ok that we don't check deadlocks around the > > ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd > > only block if there's a recovery conflict, in which killing the process > > ought > > to succeed? > > The startup process is supposed to always "win" in any deadlock > situation, so I'm not sure what you think is a problem here. We get > the conflicting lockers. We kill them. If they don't die, that's a > bug, but killing ourselves doesn't really help anything; if we die, > the whole system goes down, which seems undesirable. The way deadlock timeout for the startup process works is that we wait for it to pass and then send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK to the backends. So it's not that the startup process would die. The question is basically whether there are cases were PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK would resolve a conflict but PROCSIG_RECOVERY_CONFLICT_LOCK wouldn't. It seems plausible that there isn't, but it's also not obvious enough that I'd fully trust it. Greetings, Andres Freund