Hi, one take-away from the Gitlab Post-Mortem appears to be that after their secondary lost replication, they were confused about what pg_basebackup was doing when they tried to rebuild it. It just sat there and did nothing (even with --verbose), so they assumed something was wrong with either the primary or the connection, and restarted it several times.
AFAICT, it turns out the checkpoint was written on the master (they probably did not use -c fast), but this wasn't obvious to them: "One of the engineers went to the secondary and wiped the data directory, then ran pg_basebackup. Unfortunately pg_basebackup would hang, producing no meaningful output, despite the --verbose option being set." [...] "Unfortunately this did not resolve the problem of pg_basebackup not starting replication immediately. One of the engineers decided to run it with strace to see what it was blocking on. strace showed that pg_basebackup was hanging in a poll call, but that did not provide any other meaningful information that might have explained why." [...] "It would later be revealed by another engineer (who wasn't around at the time) that this is normal behavior: pg_basebackup will wait for the primary to start sending over replication data and it will sit and wait silently until that time. Unfortunately this was not clearly documented in our engineering runbooks nor in the official pg_basebackup document." ISTM that even with WAL streaming, nothing would be written on the client server until the checkpoint is complete, as do_pg_start_backup() runs the checkpoint and only returns the starting WAL location afterwards. The attached (untested) patch is to kick of a discussion on how to improve the situation, it is supposed to mention the checkpoint when --verbose is used and adds a paragraph about the checkpoint being run to the Notes section of the documentation. Michael https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index c9dd62c..a298e5c 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -660,6 +660,14 @@ PostgreSQL documentation <title>Notes</title> <para> + At the beginning of the backup, a checkpoint needs to be written on the + server the backup is taken from. Especially if the option + <literal>--checkpoint=fast</literal> is not used, this can take some time + during which <application>pg_basebackup</application> will be idle on the + server it is running on. + </para> + + <para> The backup will include all files in the data directory and tablespaces, including the configuration files and any additional files placed in the directory by third parties, except certain temporary files managed by diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index b6463fa..ae18c16 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -1754,6 +1754,9 @@ BaseBackup(void) if (maxrate > 0) maxrate_clause = psprintf("MAX_RATE %u", maxrate); + if (verbose) + fprintf(stderr, "%s: initiating base backup, waiting for checkpoint to complete\n", progname); + basebkp = psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s", escaped_label, @@ -1771,6 +1774,9 @@ BaseBackup(void) disconnect_and_exit(1); } + if (verbose) + fprintf(stderr, "%s: checkpoint completed\n", progname); + /* * Get the starting xlog position */
-- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers