On Tue, Jun 29, 2010 at 10:21 AM, Kevin Grittner <kevin.gritt...@wicourts.gov> wrote: > Robert Haas <robertmh...@gmail.com> wrote: > >> ...with this patch, following the above, you get: >> >> FATAL: invalid record in WAL stream >> HINT: Take a new base backup, or remove recovery.conf and restart >> in read-write mode. >> LOG: startup process (PID 6126) exited with exit code 1 >> LOG: terminating any other active server processes > > If someone is sloppy about how they copy the WAL files around, they > could temporarily have a truncated file.
Can you explain the scenario you're concerned about in more detail? > If we want to be tolerant > of straight file copies, without a temporary name or location with a > move on completion, we would need some kind of retry or timeout. It > appears that you have this hard-coded to five retries. I'm not > saying this is a bad setting, but I always wonder about hard-coded > magic numbers like this. What's the delay between retries? How did > you arrive at five as the magic number? It's approximately the number Josh Berkus suggested in an email upthread. In other words, SWAG. There's not a fixed delay between retries - it represents the number of times that we have either (a) streamed the relevant chunk from the master via WALSender, or (b) retrieved the segment from the archive with restore_command. The first retry CAN help, if WAL streaming from master to standby was interrupted unexpectedly. AFAIK, the additional retries after that are just paranoia, but I can't rule out the possibility that I'm missing something, in which case we might have to rethink the whole approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers