On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandboss...@gmail.com> wrote: > > >> 4. Is it a good idea to add log messages in the DoCustodianTasks() > >> loop? Maybe at a debug level? The log message can say the current task > >> the custodian is processing. And/Or setting the custodian's status on > >> the ps display is also a good idea IMO. > > I'd like to pick these up in a new thread if/when this initial patch set is > committed. The tasks already do some logging, and the checkpointer process > doesn't update the ps display for these tasks today.
It'll be good to have some kind of dedicated monitoring for the custodian process as it can do a "good" amount of work at times and users will have a way to know what it currently is doing - it can be logs at debug level, progress reporting via ereport_startup_progress()-sort of mechanism, ps display, pg_stat_custodian or a special function that tells some details, or some other. In any case, I agree to park this for later. > >> 0002 and 0003: > >> 1. > >> +CHECKPOINT; > >> +DO $$ > >> I think we need to ensure that there are some snapshot files before > >> the checkpoint. Otherwise, it may happen that the above test case > >> exits without the custodian process doing anything. > >> > >> 2. I think the best way to test the custodian process code is by > >> adding a TAP test module to see actually the custodian process kicks > >> in. Perhaps, add elog(DEBUGX,...) messages to various custodian > >> process functions and see if we see the logs in server logs. > > The test appears to reliably create snapshot and mapping files, so if the > directories are empty at some point after the checkpoint at the end, we can > be reasonably certain the custodian took action. I didn't add explicit > checks that there are files in the directories before the checkpoint > because a concurrent checkpoint could make such checks unreliable. I think you're right. I added sqls to see if the snapshot and mapping files count > 0, see [1] and the cirrus-ci members are happy too - https://github.com/BRupireddy/postgres/tree/custodian_review_2. I think we can consider adding these count > 0 checks to tests. > >> 0004: > >> I think the 0004 patch can be merged into 0001, 0002 and 0003 patches. > >> Otherwise the patch LGTM. > > I'm keeping this one separate because I've received conflicting feedback > about the idea. If we classify custodian as a process doing non-critical tasks that have nothing to do with regular server functioning, then processing ShutdownRequestPending looks okay. However, delaying these non-critical tasks such as file removals which reclaims disk space might impact the server overall especially when it's reaching 100% disk usage and we want the custodian to do its job fully before we shutdown the server. If we delay processing shutdown requests, that can impact the server overall (might delay restarts, failovers etc.), because at times there can be a lot of tasks with a good amount of work pending in the custodian's task queue. Having said above, I'm okay to process ShutdownRequestPending as early as possible, however, should we also add CHECK_FOR_INTERRUPTS() alongside ShutdownRequestPending? Also, I think it's enough to just have ShutdownRequestPending check in DoCustodianTasks(void)'s main loop and we can let RemoveOldSerializedSnapshots() and RemoveOldLogicalRewriteMappings() do their jobs to the fullest as they do today. While thinking about this, one thing that really struck me is what happens if we let the custodian exit, say after processing ShutdownRequestPending immediately or after a restart, leaving other queued tasks? The custodian will never get to work on those tasks unless the requestors (say checkpoint or some other process) requests it to do so after restart. Maybe, we don't need to worry about it. Maybe we need to worry about it. Maybe it's an overkill to save the custodian's task state to disk so that it can come up and do the leftover tasks upon restart. > > Another comment: > > IIUC, there's no custodian_delay GUC as we want to avoid unnecessary > > wakeups for power savings (being discussed in the other thread). > > However, can it happen that the custodian missed to capture SetLatch > > wakeups by other backends? In other words, can the custodian process > > be sleeping when there's work to do? > > I'm not aware of any way this could happen, but if there is one, I think we > should treat it as a bug instead of relying on the custodian process to > periodically wake up and check for work to do. One possible scenario is that the requestor adds its task details to the queue and sets the latch, the custodian can miss this SetLatch() when it's in the midst of processing a task. However, it guarantees the requester that it'll process the added task after it completes the current task. And, I don't know the other reasons when the custodian can miss SetLatch(). [1] diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out index 214a514a0a..0029e48852 100644 --- a/contrib/test_decoding/expected/rewrite.out +++ b/contrib/test_decoding/expected/rewrite.out @@ -163,6 +163,20 @@ DROP FUNCTION iamalongfunction(); DROP FUNCTION exec(text); DROP ROLE regress_justforcomments; -- make sure custodian cleans up files +-- make sure snapshot files exist for custodian to clean up +SELECT count(*) > 0 FROM pg_ls_logicalsnapdir(); + ?column? +---------- + t +(1 row) + +-- make sure rewrite mapping files exist for custodian to clean up +SELECT count(*) > 0 FROM pg_ls_logicalmapdir(); + ?column? +---------- + t +(1 row) + CHECKPOINT; DO $$ DECLARE diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql index d66f70f837..c076809f37 100644 --- a/contrib/test_decoding/sql/rewrite.sql +++ b/contrib/test_decoding/sql/rewrite.sql @@ -107,6 +107,13 @@ DROP FUNCTION exec(text); DROP ROLE regress_justforcomments; -- make sure custodian cleans up files + +-- make sure snapshot files exist for custodian to clean up +SELECT count(*) > 0 FROM pg_ls_logicalsnapdir(); + +-- make sure rewrite mapping files exist for custodian to clean up +SELECT count(*) > 0 FROM pg_ls_logicalmapdir(); + CHECKPOINT; DO $$ DECLARE -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com