Am 16.11.2016 um 22:07 schrieb Gunnar "Nick" Bluth: > Am 16.11.2016 um 15:36 schrieb Stephen Frost: >> Gunnar, all, >> >> * Gunnar "Nick" Bluth (gunnar.bluth.ext...@elster.de) wrote: >>> Am 16.11.2016 um 11:37 schrieb Gunnar "Nick" Bluth: >>>> I ran into this issue (see patch) a few times over the past years, and >>>> tend to forget it again (sigh!). Today I had to clean up a few hundred >>>> GB of unarchived WALs, so I decided to write a patch for the >>>> documentation this time. >>> >>> Uhm, well, the actual problem was a stale replication slot... and >>> tomatoes on my eyes, it seems ;-/. Ashes etc.! >>> >>> However, I still think a warning on (esp. rsync's) RCs >= 128 is worth >>> considering (see -v2 attached). >> >> Frankly, I wouldn't suggest including such wording as it would imply >> that using a bare rsync command is an acceptable configuration of >> archive_command. It isn't. At the very least, a bare rsync does >> nothing to ensure that the WAL has been fsync'd to permanent storage >> before returning, leading to potential data loss due to the WAL >> segment being removed by PG before the new segment has been permanently >> stored. > > I for myself deem a UPS-backed server in a different DC a pretty good > starting point, and I reckon many others do as well... obviously it's > not a belt and bracers solution, but my guess would be that > 90% of > users have something similar in place, many of them actually using rsync > (or scp) one way or the other (or, think WAL-E et. al., how do you force > an fsync on AWS?!?). > In environments where there's a risk of the WAL segment being > overwritten before that target server has fsync'd, heck, yeah, you're > right. But then you'd probably have something quite sophisticated in > place, and hate to see allegedly random _FATAL_ errors that are _not > documented outside the source code_ even more. Esp. when you can't tell > for sure (from the docs) if archiving your WAL segment will be retried > or not. > >> The PG documentation around archive command is, at best, a starting >> point for individuals who wish to implement their own proper backup >> solution, not as examples of good practice for production environments. > > True. Which doesn't mean there's no room for more hints, like "ok, we > throw a FATAL error sometimes, but they're not really a problem, you > know, it's just external software that basically everyone uses at one > point or the other doing odd things sometimes" ;-). > > Alas, I've been hunting a red herring today, cause when you find your > pg_xlog cluttered with old files _and_ see FATAL archiving messages in > your logs, your first thought is not "there's prolly a replication slot > left over", but "uh oh, those archive_command calls failed, so something > might be somehow stuck now". > > I'll try to come up with something more comprehensive, taking your > comments into account...
So, attached is what I came up with. It's obviously not "complete", however it points out the RC >= 128 "quirk" and also mentions Stephen's remarks on rsync (although to get actual _data loss_, you'd have to have a power outage in the DC caused by your PG server exploding... ;-). Cheers, -- Gunnar "Nick" Bluth RHCE/SCLA Mobil +49 172 8853339 Email: gunnar.bl...@pro-open.de _____________________________________________________________ In 1984 mainstream users were choosing VMS over UNIX. Ten years later they are choosing Windows over UNIX. What part of that message aren't you getting? - Tom Payne
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 6eaed1e..09f0787 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -587,7 +587,8 @@ tar -cf backup.tar /usr/local/pgsql/data the administrator specify a shell command to be executed to copy a completed segment file to wherever it needs to go. The command could be as simple as a <literal>cp</>, or it could invoke a complex shell - script — it's all up to you. + script — it's all up to you. There however are some things to consider + when creating such a command, most of which are covered below. </para> <para> @@ -636,7 +637,11 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0 <productname>PostgreSQL</> will assume that the file has been successfully archived, and will remove or recycle it. However, a nonzero status tells <productname>PostgreSQL</> that the file was not archived; - it will try again periodically until it succeeds. + it will try again periodically until it succeeds. Note that an exit + status of 128 or higher will cause the archiver to exit (and restart), + resulting in a (potentially misleading) FATAL error in the server log. + E.g., <command>rsync</> tends to return exit statuses of 255 on network + issues. </para> <para> @@ -696,6 +701,14 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0 preserve the file name (<literal>%f</>). </para> + <para> + Depending on your specific requirements (and datacenter layout), you may + want to make sure that the archived WAL segments have been written out to + persistent storage before the <command>archive_command</> returns. + Otherwise, a WAL segment that is assumed to be archived could be recycled + prematurely, rendering your archive incomplete in case of a power outage. + </para> + <para> Note that although WAL archiving will allow you to restore any modifications made to the data in your <productname>PostgreSQL</> database,
0x3289338C.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature