[GENERAL] figuring out a streaming replication failure

Scott Ribe Mon, 15 Nov 2010 09:38:03 -0800

The standby log:

->  2010-11-14 17:40:16 MST - 887 -LOG:  database system was shut down in 
recovery at 2010-11-14 17:40:10 MST
->  2010-11-14 17:40:16 MST - 887 -LOG:  entering standby mode
->  2010-11-14 17:40:16 MST - 887 -LOG:  consistent recovery state reached at 
3/3988FF8
->  2010-11-14 17:40:16 MST - 887 -LOG:  redo starts at 3/3988F68
->  2010-11-14 17:40:16 MST - 887 -LOG:  invalid record length at 3/3988FF8
->  2010-11-14 17:40:16 MST - 885 -LOG:  database system is ready to accept 
read only connections
->  2010-11-14 17:40:16 MST - 890 -LOG:  streaming replication successfully 
connected to primary
->  2010-11-15 02:24:26 MST - 890 -FATAL:  could not receive data from WAL 
stream: FATAL:  requested WAL segment 000000010000000300000004 has already been 
removed
        
->  2010-11-15 02:24:26 MST - 887 -LOG:  unexpected pageaddr 2/B9BF2000 in log 
file 3, segment 4, offset 12525568
->  2010-11-15 02:24:27 MST - 2790 -LOG:  streaming replication successfully 
connected to primary
->  2010-11-15 02:24:27 MST - 2790 -FATAL:  could not receive data from WAL 
stream: FATAL:  requested WAL segment 000000010000000300000004 has already been 
removed
        
->  2010-11-15 02:24:32 MST - 2791 -LOG:  streaming replication successfully 
connected to primary
->  2010-11-15 02:24:32 MST - 2791 -FATAL:  could not receive data from WAL 
stream: FATAL:  requested WAL segment 000000010000000300000004 has already been 
removed


...

Now, the standby is geographically isolated from the master, so it's over an 
internet connection, so it's not a shock that with a large enough update and 
wal_keep_segments not large enough, speed of disk would outrun speed of network 
sufficiently for this to happen.

But as far as I know there was almost no write activity at 2am, no active users 
at all, no batch processing. There is a pg_dumpall that kicks off at 2am and 
these errors start about the same time that it finished. I also did the 
original synch and standby launch immediately after a mass update before 
autovacuum had a chance to run, so at some point there would be a lot of tuples 
marked dead.

wal_keep_segments was at 64, the first segment still around was 
000000010000000300000010, checkpoint_segments was 16. In the midst of the 
pg_dumpall the master logs do show messages about checkpoint flushing too 
often. The 70ish log segments still around show mod times right around 2:23, 
progressing a second or so each, whereas they were created over a much longer 
period going back to the day before.

1 question: what happened here? Why were log files created the day before 
updated?

1 suggestion: would it be possible to not delete wal segments that are needed 
by a currently attached standby?

-- 
Scott Ribe
[email protected]
http://www.elevated-dev.com/
(303) 722-0567 voice





-- 
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] figuring out a streaming replication failure

Reply via email to