Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

Michael Paquier Mon, 21 Jan 2013 16:07:27 -0800

On Fri, Jan 18, 2013 at 6:20 PM, Heikki Linnakangas <[email protected]
> wrote:


> Hmm, so it's the same issue I thought I fixed yesterday. My patch only
> fixed it for the case that the timeline switch is in the first page of the
> segment. When it's not, you still get two calls for a WAL record, first one
> for the first page in the segment, to verify that, and then the page that
> actually contains the record. The first call leads XLogPageRead to think it
> needs to read from the old timeline.
>
> We didn't have this problem before the xlogreader refactoring because
> XLogPageRead() was always called with the RecPtr of the record, even when
> we actually read the segment header from the file first. We'll have to
> somehow get that same information, the RecPtr of the record we're actually
> interested in, to XLogPageRead(). We could add a new argument to the
> callback for that, or we could keep xlogreader.c as it is and pass it
> through from ReadRecord to XLogPageRead() in the private struct.
>
> An explicit argument to the callback is probably best. That's
> straightforward, and it might be useful for the callback to know the actual
> WAL position that xlogreader.c is interested in anyway. See attached.
>
Just to let you know that I am still getting the error even after commit
2ff6555.
With the same scenario:
1) Start a master with 2 slaves
2) Kill/Stop slave
3) Promote slave 1, it switches to timeline 2
Log on slave 1
LOG:  selected new timeline ID: 2
4) Reconnect slave 2 to save 1, slave 2 remains stuck in timeline 1 even if
recovery_target_timeline is set to latest
Log on slave 1 at this moment:
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: TIMELINE_HISTORY 2
DEBUG:  received replication command: START_REPLICATION 0/5000000 TIMELINE 1
Slave 1 receives command to start replication with timeline 1, while it is
sync with timeline 2.
Log on slave 2 at this moment:
LOG:  restarted WAL streaming at 0/5000000 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/5014200
DEBUG:  walreceiver ended streaming and awaits new instructions

The timeline history file is the same for both nodes:
$ cat 00000002.history
1    0/5014200    no recovery target specified

I might be wrong, but shouldn't there be also an entry for timeline 2 in
this file?

Am I missing something?
-- 
Michael Paquier
http://michael.otacoo.com

Re: [HACKERS] Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

Reply via email to