On 2022/02/02 23:46, Bharath Rupireddy wrote:
On Tue, Feb 1, 2022 at 9:39 PM Fujii Masao <masao.fu...@oss.nttdata.com> wrote:
I found that CreateRestartPoint() already reported the redo lsn as follows 
after emitting the restartpoint log message. To avoid duplicated logging of the 
same information, we should update this code?

         ereport((log_checkpoints ? LOG : DEBUG2),
                         (errmsg("recovery restart point at %X/%X",
                                         LSN_FORMAT_ARGS(lastCheckPoint.redo)),
                          xtime ? errdetail("Last completed transaction was at log 
time %s.",
                                                            
timestamptz_to_str(xtime)) : 0));

This code reports lastCheckPoint.redo as redo lsn. OTOH, with the patch, 
LogCheckpointEnd() reports ControlFile->checkPointCopy.redo. They may be 
different, for example, when the current DB state is not DB_IN_ARCHIVE_RECOVERY. 
In this case, which lsn should we report as redo lsn?

Do we ever reach CreateRestartPoint when ControlFile->stat !=
DB_IN_ARCHIVE_RECOVERY? Assert(ControlFile->state ==
DB_IN_ARCHIVE_RECOVERY); in CreateRestartPoint doesn't fail any
regression tests.

ISTM that CreateRestartPoint() can reach the condition ControlFile->state != 
DB_IN_ARCHIVE_RECOVERY. Please imagine the case where CreateRestartPoint() has 
already started and calls CheckPointGuts(). If the standby server is promoted 
while CreateRestartPoint() is flushing the data to disk at CheckPointGuts(), the 
state would be updated to DB_IN_PRODUCTION and CreateRestartPoint() can see the 
state != DB_IN_ARCHIVE_RECOVERY later.

As far as I read the code, this case seems to be able to make the server unrecoverable. 
If this case happens, since pg_control is not updated, pg_control still indicates the 
REDO LSN of last valid restartpoint. But CreateRestartPoint() seems to delete old WAL 
files based on its "current" REDO LSN not pg_control's REDO LSN. That is, WAL 
files required for the crash recovery starting from pg_control's REDO LSN would be 
removed.

If this understanding is right, to address this issue, probably we need to make 
CreateRestartPoint() do nothing (return immediately) when the state != 
DB_IN_ARCHIVE_RECOVERY?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply via email to