Re: [DOCS] Missing important information in backup.sgml

Gunnar "Nick" Bluth Wed, 23 Nov 2016 10:25:35 -0800

Am 16.11.2016 um 22:07 schrieb Gunnar "Nick" Bluth:
> Am 16.11.2016 um 15:36 schrieb Stephen Frost:
>> Gunnar, all,
>>
>> * Gunnar "Nick" Bluth (gunnar.bluth.ext...@elster.de) wrote:
>>> Am 16.11.2016 um 11:37 schrieb Gunnar "Nick" Bluth:
>>>> I ran into this issue (see patch) a few times over the past years, and
>>>> tend to forget it again (sigh!). Today I had to clean up a few hundred
>>>> GB of unarchived WALs, so I decided to write a patch for the
>>>> documentation this time.
>>>
>>> Uhm, well, the actual problem was a stale replication slot... and
>>> tomatoes on my eyes, it seems ;-/. Ashes etc.!
>>>
>>> However, I still think a warning on (esp. rsync's) RCs >= 128 is worth
>>> considering (see -v2 attached).
>>
>> Frankly, I wouldn't suggest including such wording as it would imply
>> that using a bare rsync command is an acceptable configuration of
>> archive_command.  It isn't.  At the very least, a bare rsync does
>> nothing to ensure that the WAL has been fsync'd to permanent storage
>> before returning, leading to potential data loss due to the WAL
>> segment being removed by PG before the new segment has been permanently
>> stored.
> 
> I for myself deem a UPS-backed server in a different DC a pretty good
> starting point, and I reckon many others do as well... obviously it's
> not a belt and bracers solution, but my guess would be that > 90% of
> users have something similar in place, many of them actually using rsync
> (or scp) one way or the other (or, think WAL-E et. al., how do you force
> an fsync on AWS?!?).
> In environments where there's a risk of the WAL segment being
> overwritten before that target server has fsync'd, heck, yeah, you're
> right. But then you'd probably have something quite sophisticated in
> place, and hate to see allegedly random _FATAL_ errors that are _not
> documented outside the source code_ even more. Esp. when you can't tell
> for sure (from the docs) if archiving your WAL segment will be retried
> or not.
> 
>> The PG documentation around archive command is, at best, a starting
>> point for individuals who wish to implement their own proper backup
>> solution, not as examples of good practice for production environments.
> 
> True. Which doesn't mean there's no room for more hints, like "ok, we
> throw a FATAL error sometimes, but they're not really a problem, you
> know, it's just external software that basically everyone uses at one
> point or the other doing odd things sometimes" ;-).
> 
> Alas, I've been hunting a red herring today, cause when you find your
> pg_xlog cluttered with old files _and_ see FATAL archiving messages in
> your logs, your first thought is not "there's prolly a replication slot
> left over", but "uh oh, those archive_command calls failed, so something
> might be somehow stuck now".
> 
> I'll try to come up with something more comprehensive, taking your
> comments into account...


So, attached is what I came up with. It's obviously not "complete",
however it points out the RC >= 128 "quirk" and also mentions Stephen's
remarks on rsync (although to get actual _data loss_, you'd have to have
a power outage in the DC caused by your PG server exploding... ;-).

Cheers,
-- 
Gunnar "Nick" Bluth
RHCE/SCLA

Mobil +49 172 8853339
Email: gunnar.bl...@pro-open.de
_____________________________________________________________
In 1984 mainstream users were choosing VMS over UNIX.
Ten years later they are choosing Windows over UNIX.
What part of that message aren't you getting? - Tom Payne

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 6eaed1e..09f0787 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -587,7 +587,8 @@ tar -cf backup.tar /usr/local/pgsql/data
     the administrator specify a shell command to be executed to copy a
     completed segment file to wherever it needs to go.  The command could be
     as simple as a <literal>cp</>, or it could invoke a complex shell
-    script &mdash; it's all up to you.
+    script &mdash; it's all up to you. There however are some things to 
consider
+    when creating such a command, most of which are covered below.
    </para>
 
    <para>
@@ -636,7 +637,11 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 
&amp;&amp; cp pg_wal/0
     <productname>PostgreSQL</> will assume that the file has been
     successfully archived, and will remove or recycle it.  However, a nonzero
     status tells <productname>PostgreSQL</> that the file was not archived;
-    it will try again periodically until it succeeds.
+    it will try again periodically until it succeeds. Note that an exit
+    status of 128 or higher will cause the archiver to exit (and restart), 
+    resulting in a (potentially misleading) FATAL error in the server log. 
+    E.g., <command>rsync</> tends to return exit statuses of 255 on network
+    issues.
    </para>
 
    <para>
@@ -696,6 +701,14 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 
&amp;&amp; cp pg_wal/0
     preserve the file name (<literal>%f</>).
    </para>
 
+  <para>
+    Depending on your specific requirements (and datacenter layout), you may
+    want to make sure that the archived WAL segments have been written out to
+    persistent storage before the <command>archive_command</> returns.
+    Otherwise, a WAL segment that is assumed to be archived could be recycled
+    prematurely, rendering your archive incomplete in case of a power outage.
+  </para>
+
    <para>
     Note that although WAL archiving will allow you to restore any
     modifications made to the data in your <productname>PostgreSQL</> database,

0x3289338C.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: [DOCS] Missing important information in backup.sgml

Reply via email to