On 06.10.2012 15:58, Amit Kapila wrote:
One more test seems to be failed. Apart from this, other tests are passed.

2. a. Master M-1
    b. Standby S-1 follows M-1
    c. insert 10 records on M-1. verify all records are visible on M-1,S-1
    d. Stop S-1
    e. insert 2 records on M-1.
    f. Stop M-1
    g. Start S-1
    h. Promote S-1
    i. Make M-1 recovery.conf such that it should connect to S-1
    j. Start M-1. Below error comes on M-1 which is expected as M-1 has more
data.
       LOG:  database system was shut down at 2012-10-05 16:45:39 IST
       LOG:  entering standby mode
       LOG:  consistent recovery state reached at 0/176A070
       LOG:  record with zero length at 0/176A070
       LOG:  database system is ready to accept read only connections
       LOG:  streaming replication successfully connected to primary
       LOG:  fetching timeline history file for timeline 2 from primary
server
       LOG:  replication terminated by primary server
       DETAIL:  End of WAL reached on timeline 1
       LOG:  walreceiver ended streaming and awaits new instructions
       LOG:  new timeline 2 forked off current database system timeline 1
before current recovery point 0/176A070
       LOG:  re-handshaking at position 0/1000000 on tli 1
       LOG:  replication terminated by primary server
       DETAIL:  End of WAL reached on timeline 1
       LOG:  walreceiver ended streaming and awaits new instructions
       LOG:  new timeline 2 forked off current database system timeline 1
before current recovery point 0/176A070
    k. Stop M-1. Start M-1. It is able to successfully connect to S-1 which
is a problem.
    l. check in S-1. Records inserted in step-e are not present.
    m. Now insert records in S-1. M-1 doesn't recieve any records. On M-1
server following log is getting printed.
       LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
       LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
       LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
       LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0
       LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000001, offset 0

Hmm, seems we need to keep track of which timeline we've used to recover before. Before restart, the master correctly notices that timeline 2 forked off earlier in its history, so it cannot recover to that timeline. But after restart the master begins recovery from the previous checkpoint, and because timeline 2 forked off timeline 1 after the checkpoint, it concludes that it can follow that timeline. It doesn't realize that it had some already recovered/flushed some WAL in timeline 1 after the fork-point.

Attached is a new version of the patch. I committed the refactoring of XLogPageRead() already, as that was a readability improvement even without this patch. All the reported issues should be fixed now, although I will continue testing this tomorrow. I added various checks that that the correct timeline is followed during recovery. minRecoveryPoint is now accompanied by a timeline ID, so that when we restart recovery, we check that we recover back to minRecoveryPoint along the same timeline as last time. Also, it now checks at beginning of recovery that the checkpoint record comes from the correct timeline. That fixes the problem that you reported above. I also adjusted the error messages on timeline history problems to be more clear.

- Heikki

Attachment: streaming-tli-switch-4.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to