Hello, we run out of disk space on our production primary database on file system dedicated for WALs (one of our logical replica died and thus WALs were accumulated). As expeced, primary instance shuts down:
Primary instance: PostgreSQL 11.3 (Ubuntu 11.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0, 64-bit 2019-09-29 19:24:28 UTC 2241 5d36488f.8c1 1 0 [53100]:PANIC: could not write to file "pg_wal/xlogtemp.2241": No space left on device 2019-09-29 19:24:30 UTC 1706 5d364871.6aa 3387 0 [00000]:LOG: WAL writer process (PID 2241) was terminated by signal 6: Aborted 2019-09-29 19:24:30 UTC 1706 5d364871.6aa 3388 0 [00000]:LOG: terminating any other active server processes 2019-09-29 19:24:30 UTC 1706 5d364871.6aa 3390 0 [00000]:LOG: all server processes terminated; reinitializing 2019-09-29 19:24:30 UTC 21499 5d9104ee.53fb 1 0 [00000]:LOG: database system was interrupted; last known up at 2019-09-29 19:23:47 UTC 2019-09-29 19:24:33 UTC 21499 5d9104ee.53fb 2 0 [00000]:LOG: recovered replication state of node 1 to CF3/442D0758 2019-09-29 19:24:33 UTC 21499 5d9104ee.53fb 3 0 [00000]:LOG: database system was not properly shut down; automatic recovery in progress 2019-09-29 19:24:33 UTC 21499 5d9104ee.53fb 4 0 [00000]:LOG: redo starts at FE9/A49830D8 2019-09-29 19:25:55 UTC 21499 5d9104ee.53fb 5 0 [00000]:LOG: redo done at FED/38FFC540 2019-09-29 19:25:55 UTC 21499 5d9104ee.53fb 6 0 [00000]:LOG: last completed transaction was at log time 2019-09-29 19:22:06.597333+00 2019-09-29 19:25:55 UTC 21499 5d9104ee.53fb 7 0 [00000]:LOG: checkpoint starting: end-of-recovery immediate 2019-09-29 19:26:59 UTC 21499 5d9104ee.53fb 8 0 [00000]:LOG: checkpoint complete: wrote 1046940 buffers (99.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=64.166 s, sync=0.008 s, total=64.212 s; sync files=184, longest=0.001 s, average=0.000 s; distance=15014380 kB, estimate=15014380 kB 2019-09-29 19:26:59 UTC 21499 5d9104ee.53fb 9 0 [53100]:FATAL: could not write to file "pg_wal/xlogtemp.21499": No space left on device 2019-09-29 19:26:59 UTC 1706 5d364871.6aa 3391 0 [00000]:LOG: startup process (PID 21499) exited with exit code 1 2019-09-29 19:26:59 UTC 1706 5d364871.6aa 3392 0 [00000]:LOG: aborting startup due to startup process failure 2019-09-29 19:26:59 UTC 1706 5d364871.6aa 3393 0 [00000]:LOG: database system is shut down Expceted bahavior, free space on file system for WAL was 14MB, so not enough for next WAL file to be saved. Once falied logical replica was back up & running, we have extened filesystem for WALs and started primary instance: 2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 1 0 [00000]:LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 2 0 [00000]:LOG: pg_stat_kcache.linux_hz is set to 1000000 2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 3 0 [00000]:LOG: listening on IPv4 address "0.0.0.0", port 5432 2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 4 0 [00000]:LOG: listening on IPv6 address "::", port 5432 2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 5 0 [00000]:LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2019-09-30 08:21:44 UTC 13637 5d91bb18.3545 1 0 [00000]:LOG: database system was shut down at 2019-09-29 19:26:59 UTC 2019-09-30 08:21:44 UTC 13637 5d91bb18.3545 2 0 [00000]:LOG: recovered replication state of node 1 to CF3/442D0758 2019-09-30 08:21:44 UTC 13646 5d91bb18.354e 1 0 [00000]:LOG: pg_partman master background worker master process initialized with role postgres 2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 6 0 [00000]:LOG: database system is ready to accept connections 2019-09-30 08:21:44 UTC 13645 5d91bb18.354d 1 0 2/0 [00000]:LOG: pg_cron scheduler started 2019-09-30 08:21:44 UTC 13647 5d91bb18.354f 1 0 3/0 [00000]:LOG: POWA connected to database powa So primary instance stated, all logical replications streams connected and as soon as logical replication for failed subscriber process WALs, unnecessery WALs were deleted as expected. Until this point, all works like a charm from primary instance point of view. Form this primary instance we are running wall shipping replica (pgBackRest is used) and this is where issues started. WAL recovery replica: PostgreSQL 11.3 (Ubuntu 11.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0, 64-bit WAL files were restored from backup till last one backed up before primary instance shut down. 2019-09-29 19:24:31.390 P00 INFO: found 0000000100000FED00000037 in the archive 2019-09-29 19:24:31.391 P00 INFO: archive-get command end: completed successfully (43ms) 2019-09-29 19:24:31 UTC 3062 5d76145f.bf6 181012 0 1/0 [00000]:LOG: restored log file "0000000100000FED00000037" from archive 2019-09-29 19:24:31.919 P00 INFO: archive-get command begin 2.16: [0000000100000FED00000038, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint= s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod 2019-09-29 19:24:31.958 P00 INFO: found 0000000100000FED00000038 in the archive 2019-09-29 19:24:31.959 P00 INFO: archive-get command end: completed successfully (41ms) 2019-09-29 19:24:31 UTC 3062 5d76145f.bf6 181013 0 1/0 [00000]:LOG: restored log file "0000000100000FED00000038" from archive 2019-09-29 19:24:32.505 P00 INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint= s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod 2019-09-29 19:24:32.505 P00 INFO: unable to find 0000000100000FED00000039 in the archive 2019-09-29 19:24:32.505 P00 INFO: archive-get command end: completed successfully (2ms) 2019-09-29 19:24:34.247 P00 INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint= s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod 2019-09-29 19:24:34.448 P00 INFO: unable to find 0000000100000FED00000039 in the archive 2019-09-29 19:24:34.448 P00 INFO: archive-get command end: completed successfully (204ms) So the replica instance was paintfully waiting and waiting and waiting for next WAL: 0000000100000FED00000039 And yes, after the long outage pgBackRest found and restored WAL 0000000100000FED00000039: 2019-09-30 08:22:03.673 P00 INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/ pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint= s3.us-east-1.amazonaws.com --repo1-s3-key=<red acted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod 2019-09-30 08:22:03.874 P00 INFO: unable to find 0000000100000FED00000039 in the archive 2019-09-30 08:22:03.875 P00 INFO: archive-get command end: completed successfully (203ms) 2019-09-30 08:22:08.678 P00 INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/ pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint= s3.us-east-1.amazonaws.com --repo1-s3-key=<red acted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod 2019-09-30 08:22:09.121 P00 INFO: found 0000000100000FED00000039 in the archive 2019-09-30 08:22:09.123 P00 INFO: archive-get command end: completed successfully (446ms) 2019-09-30 08:22:09 UTC 3062 5d76145f.bf6 181014 0 1/0 [00000]:LOG: restored log file "0000000100000FED00000039" from archive 2019-09-30 08:22:09 UTC 3062 5d76145f.bf6 181015 0 1/0 [00000]:LOG: invalid contrecord length 1956 at FED/38FFE208 2019-09-30 08:22:13.683 P00 INFO: archive-get command begin 2.16: [0000000100000FED00000038, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint= s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod 2019-09-30 08:22:14.132 P00 INFO: found 0000000100000FED00000038 in the archive 2019-09-30 08:22:14.135 P00 INFO: archive-get command end: completed successfully (453ms) 2019-09-30 08:22:14 UTC 3062 5d76145f.bf6 181016 0 1/0 [00000]:LOG: restored log file "0000000100000FED00000038" from archive 2019-09-30 08:22:14.195 P00 INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint= s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod 2019-09-30 08:22:14.236 P00 INFO: found 0000000100000FED00000039 in the archive 2019-09-30 08:22:14.238 P00 INFO: archive-get command end: completed successfully (44ms) 2019-09-30 08:22:14 UTC 3062 5d76145f.bf6 181017 0 1/0 [00000]:LOG: restored log file "0000000100000FED00000039" from archive 2019-09-30 08:22:14 UTC 3062 5d76145f.bf6 181018 0 1/0 [00000]:LOG: invalid contrecord length 1956 at FED/38FFE208 But recovery on replica failed to proceed WAL file 0000000100000FED00000039 with log message: " invalid contrecord length 1956 at FED/38FFE208". Since the WAL 0000000100000FED00000039 still exists on primary instance, I've copied it to WAL replica, but it was identical to the file restored by pgBackRest, so I assume from this, that WAL file backup/restore was not root cause fro this issue. Recovery conf on replica instance is simple: standby_mode = 'on' restore_command = 'pgbackrest --log-level-console=info --process-max=6 --stanza=anl_master_prod archive-get %f "%p"' During planned properly proceeded shutdown on primary instance, recovery on WAL replica resumes smoothly. And for sure, after restoring database from backup taken past the primary instance issue, WAL replica works perfectly, as usual. Except the obvious need to re-configure monitoring to warn on WAL free space sooner, to avoid such case, my expectation was, taht WAL replica will be able to resume recovery it was not. We have stored copy of 0000000100000FED0000003[89] WAL file for analyses, is there something what can help us to find the issu root cause and mitigate this issue? I've found BUG #15412,seems to be same issue, but remains open. Thanks for any hints how to recover from this issue or analyze it deeper to find root cause. Kind regards Ales Zeleny