On 2020-May-16, Andres Freund wrote: > I, independent of this patch, added a few additional paths in which > checkpointer's latch is reset, and I found a few shutdowns in regression > tests to be extremely slow / timing out. The reason for that is that > the only check for interrupts is at the top of the loop. So if > checkpointer gets SIGUSR2 we don't see ShutdownRequestPending until we > decide to do a checkpoint for other reasons.
Ah, yeah, this seems a genuine bug. > I also suspect that it could have harmful consequences to not do a > AbsorbSyncRequests() if something "ate" the set latch. I traced through this when looking over the previous fix, and given that checkpoint execution itself calls AbsorbSyncRequests frequently, I don't think this one qualifies as a bug. > I don't think it's reasonable to expect this much code between a > ResetLatch and WaitLatch to never reset a latch. So I think we need to > make the coding more robust in face of that. Without having to duplicate > the top and the bottom of the loop. That makes sense to me. > One way to do that would be to WaitLatch() call to much earlier, and > only do a WaitLatch() if do_checkpoint is false. Roughly like in the > attached. Hm. I'd do "WaitLatch() / continue" in the "!do_checkpoint" block, and put the checpkoint code not in the else block; seems easier to read to me. While we're here, can we change CreateCheckPoint to return true so that we can do ckpt_performed = do_restartpoint ? CreateRestartPoint(flags) : CreateCheckPoint(flags); instead of the mess we have there now? (Also add a comment that CreateCheckPoint must not return false, to avoid messing with the schedule) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services