Re: O(n) tasks cause lengthy startups and checkpoints

Bharath Rupireddy Mon, 05 Dec 2022 23:28:57 -0800

On Sat, Dec 3, 2022 at 12:45 AM Nathan Bossart <nathandboss...@gmail.com> wrote:
>
> On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
> > On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandboss...@gmail.com> 
> > wrote:
> >> The test appears to reliably create snapshot and mapping files, so if the
> >> directories are empty at some point after the checkpoint at the end, we can
> >> be reasonably certain the custodian took action.  I didn't add explicit
> >> checks that there are files in the directories before the checkpoint
> >> because a concurrent checkpoint could make such checks unreliable.
> >
> > I think you're right. I added sqls to see if the snapshot and mapping
> > files count > 0, see [1] and the cirrus-ci members are happy too -
> > https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
> > think we can consider adding these count > 0 checks to tests.
>
> My worry about adding "count > 0" checks is that a concurrent checkpoint
> could make them unreliable.  In other words, those checks might ordinarily
> work, but if an automatic checkpoint causes the files be cleaned up just
> beforehand, they will fail.


Hm. It would have been better with a TAP test module for testing the
custodian code reliably. Anyway, that mustn't stop the patch getting
in. If required, we can park the TAP test module for later - IMO.
Others may have different thoughts here.

> > Having said above, I'm okay to process ShutdownRequestPending as early
> > as possible, however, should we also add CHECK_FOR_INTERRUPTS()
> > alongside ShutdownRequestPending?
>
> I'm not seeing a need for CHECK_FOR_INTERRUPTS.  Do you see one?

Since the custodian has SignalHandlerForShutdownRequest as SIGINT and
SIGTERM handlers, unlike StatementCancelHandler and die respectively,
no need of CFI I guess. And also none of the CFI signal handler flags
applies to the custodian.

> > While thinking about this, one thing that really struck me is what
> > happens if we let the custodian exit, say after processing
> > ShutdownRequestPending immediately or after a restart, leaving other
> > queued tasks? The custodian will never get to work on those tasks
> > unless the requestors (say checkpoint or some other process) requests
> > it to do so after restart. Maybe, we don't need to worry about it.
> > Maybe we need to worry about it. Maybe it's an overkill to save the
> > custodian's task state to disk so that it can come up and do the
> > leftover tasks upon restart.
>
> Yes, tasks will need to be retried when the server starts again.  The ones
> in this patch set should be requested again during the next checkpoint.
> Temporary file cleanup would always be requested during server start, so
> that should be handled as well.  Even today, the server might abruptly shut
> down while executing these tasks, and we don't have any special handling
> for that.

Right.

The v18 patch set posted upthread
https://www.postgresql.org/message-id/20221201214026.GA1799688%40nathanxps13
looks good to me. I see the CF entry is marked RfC -
https://commitfest.postgresql.org/41/3448/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: O(n) tasks cause lengthy startups and checkpoints

Reply via email to