On 2022/07/07 9:09, Michael Paquier wrote:
On Wed, Jul 06, 2022 at 11:27:58PM +0900, Fujii Masao wrote:
For the test, BASE_BACKUP needs to be canceled after it finishes
do_pg_backup_start(), i.e., checkpointing, and before it calls
do_pg_backup_stop(). So the timing to cancel that seems more severe
than the test added in 0475a97f. I'm afraid that some tests can
easily cancel the BASE_BACKUP while it's performing a checkpoint in
do_pg_backup_start(). So for now I'm thinking to avoid such an
unstable test.
Hmm. In order to make sure that the checkpoint of the base backup is
completed, and assuming that the checkpoint is fast while the base
backup has a max rate, you could rely on a query that does a
poll_query_until() on pg_control_checkpoint(), no? As long as you use
IPC::Run::start, pg_basebackup would be async so the polling query and
the cancellation can be done in parallel of it. 0475a97 did almost
that, except that it waits for the WAL sender to be started.
There seems to be some corner cases where we cannot rely on that.
If "spread" checkpoint is already running when BASE_BACKUP is executed,
poll_query_until() may report the end of that "spread" checkpoint before BASE_BACKUP
internally starts its checkpoint. Which may cause the test to fail.
If BASE_BACKUP is accidentally canceled after poll_query_until() reports the
end of checkpoint but before do_pg_backup_start() finishes (i.e., before
entering the error cleanup block using do_pg_abort_backup callback), the test
may fail.
Probably we may be able to decrease the risk of those test failures by using
some techniques, e.g., adding the fixed wait time before requesting the cancel.
But I'm not sure if it's worth adding the test for the corner case issue that I
reported at the risk of adding the unstable test. The issue could happen only
when both BASE_BACKUP and low level API for backup are eecuted via logical
replication walsender mode, and BASE_BACKUP is canceled or terminated.
But if many think that it's worth adding the test, I will give a try. But even
in that case, I think it's better to commit the proposed patch at first to fix
the bug, and then to write the patch adding the test.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION