Re: [COMMITTERS] pgsql: Fast promote mode skips checkpoint at end of recovery.

Fujii Masao Tue, 29 Jan 2013 08:38:41 -0800

On Wed, Jan 30, 2013 at 1:27 AM, Fujii Masao <[email protected]> wrote:
> On Tue, Jan 29, 2013 at 9:07 AM, Simon Riggs <[email protected]> wrote:
>> Fast promote mode skips checkpoint at end of recovery.
>> pg_ctl promote -m fast will skip the checkpoint at end of recovery so that we
>> can achieve very fast failover when the apply delay is low. Write new WAL 
>> record
>> XLOG_END_OF_RECOVERY to allow us to switch timeline correctly for downstream 
>> log
>> readers. If we skip synchronous end of recovery checkpoint we request a 
>> normal
>> spread checkpoint so that the window of re-recovery is low.
>
> When I tested this feature, I encountered the following FATAL message.
>
>     FATAL:  highest timeline 1 of the primary is behind recovery timeline 2
>
> Is this an intentional behavior or bug? What I did in my test is:
>
> 1. Set up one master (A), one standby (B), one cascade standby (C)
> 2. After running pgbench -i -s 10, I promoted the standby (B) with fast mode
> 3. Then, I shut down the server (B) with immediate mode after it has been
>     brought up to the master before end-of-recovery checkpoint has not been
>     completed.
> 4. Restart the server (B).
> 5. After the standby (C) established the replication connection with (B),
>     I got the above FATAL messages repeatedly.
>
> Promoting (B) increments the timeline ID to 2 and generates the timeline
> history file. But after restarting (B), its timeline ID is reset to 1
> unexpectedly.
> This seems to be the cause of the problem.
>
> To address this problem, we should switch to new timeline ID whenever
> we read the XLOG_END_OF_RECOVERY even if it's a crash recovery?


On second thought, we don't need such a complicated test case to produce
the problem which derives from the same cause of reported problem. The
procedure to produce the problem is:

1. Set up one master (A) and one standby (B)
2. Promote (B) with fast mode after running pgbench -i -s 10
3. Execute the write transaction on new master (B)
4. Shut down (B) with immediate mode before end-of-recovery checkpoint
has been completed
5. Restart (B)

Then you can confirm that the write transaction that you executed in #3 has
been lost.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-committers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Fast promote mode skips checkpoint at end of recovery.

Reply via email to