On 02/13/2014 01:37 PM, Kyotaro HORIGUCHI wrote:
At Thu, 13 Feb 2014 10:11:22 +0200, Heikki Linnakangas <hlinnakan...@vmware.com> 
wrote in <52fc7e2a.9060...@vmware.com>
On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote:
Wouldn't it be better to not archive the old segment, and instead
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same

Ok, I think we're in agreement that that's the way to go for master.

Does this mean that no mechanical solution for this situation
will be given and an operator should remove the older segment for
the same segment id before starting recovoery process?

I don't understand. With the proposed change to never archive a partial segment, there is no segment with the same filename in the archive. There is nothing to remove. Or did you mean that you have to do that in 9.0-9.2 back-branches, if we only back-patch this change to 9.3?

If it's already in the archive, it's not going to be removed from the

I have understood the conclusion so far is not archiving the
older segment when promotion but it seems a bit odd story as you
suggested. If the conclusion here is no aid as my new
understanding, would you let me hear the reason why recovery have
changed to prefer archive to pg_xlog?

The commit abf5c5c9a4 seems to change the behavior but I don't
find the reason for the change.

Well, it didn't work too well before abf5c5c9a4 either. With your test script, and commit abf5c5c9a4 reverted [1], you got this:

waiting for server to start....LOG: database system was interrupted; last known up at 2014-02-13 14:17:53 EET
LOG:  starting archive recovery
LOG:  restored log file "00000002.history" from archive
/bin/cp: cannot stat ‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such file or directory
LOG:  restored log file "000000010000000000000002" from archive
LOG:  record with zero length at 0/2000090
LOG:  invalid primary checkpoint record
/bin/cp: cannot stat ‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such file or directory
LOG:  restored log file "000000010000000000000002" from archive
LOG:  using previous checkpoint record at 0/2000028
LOG:  consistent recovery state reached at 0/2000028
LOG:  record with zero length at 0/2000090
LOG:  redo is not required
/bin/cp: cannot stat ‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such file or directory
LOG:  restored log file "000000010000000000000002" from archive
/bin/cp: cannot stat ‘/home/heikki/pgsql.93stable/test/arc/00000003.history’: No such file or directory
LOG:  selected new timeline ID: 3
LOG:  restored log file "00000002.history" from archive
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

Note the "invalid primary checkpoint record" message. It restored the partial 000000010000000000000002 file from the archive, and tried to recover using that instead of the 000000020000000000000002 file from pg_xlog. The primary checkpoint record is not present in 000000010000000000000002, so it fell back to recover from the previous checkpoint.

It was the exact same issue, even though it appeared to work if you don't look carefully. The second recovery never replayed the WAL generated on timeline 2, it branched off directly from timeline 1, at the same point where timeline 2 did. You can see that by looking at the history files. Also, if you create a table and insert a row to it after the first recovery, just before killing the server, you'll see that the row is not there after the second recovery. That's wrong; the data files might already contain some changes from timeline 2, and you end up with a corrupt database. Thanks to commit abf5c5c9a4, the situation is at least detected, and the system refuses to start up.

The behavior where we prefer a segment from archive with lower TLI over a file with higher TLI in pg_xlog actually changed in commit a068c391ab0. Arguably changing it wasn't a good idea, but the problem your test script demonstrates can be fixed by not archiving the partial segment, with no change to the preference of archive/pg_xlog. As discussed, archiving a partial segment seems like a bad idea anyway, so let's just stop doing that.

[1] just applying commit abf5c5c9a4 in reverse doesn't apply cleanly due to further changes to the code, but I did this to simulate the old behavior:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a4a504..352d3f7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4366,6 +4366,7 @@ readRecoveryCommandFile(void)

        /* Enable fetching from archive recovery area */
        ArchiveRecoveryRequested = true;
+       InArchiveRecovery = true;

         * If user specified recovery_target_timeline, validate it or compute 

- Heikki

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to