On Sat, Feb 7, 2026 at 5:41 AM Nikolay Samokhvalov <[email protected]> wrote: > > Hi hackers, > > I observed a case when users who used "pgbackrest restore", not using > "--type=standby", which means that pgBackRest placed recover.signal, > and since they wanted this node to be a standby, then manually placed > standby.signal too, and configured primary_conninfo. > > Postgres allows both recovery.signal and standby.signal to coexist – > no complaints, it starts, and gives standby.signal a precedence. > > However, this might lead to a latent problem: imagine a standby gots > promoted and then goes through a subsequent failover cycle. In this > case, the orphaned recovery.signal causes the node to perform an > unexpected PITR recovery and self-promote to a new timeline instead of > remaining a standby. Which surprised the user a lot. > > Exact sequence that leads to trouble (Reproduced on PostgreSQL 17.7 > with pgBackRest 2.58.0): > > 1. Restore a backup (pgBackRest default creates `recovery.signal`) > 2. Add `standby.signal` and `primary_conninfo` for streaming replication > 3. Start as standby — works fine (`standby.signal` takes precedence) > 4. Promote this standby to primary (e.g., switchover) — > `standby.signal` is removed, `recovery.signal` is NOT > 5. Node runs as primary with `recovery.signal` still on disk > 6. Node crashes or is stopped > 7. pg_rewind + add `standby.signal` to rejoin as standby > 8. Start — works as standby again, `recovery.signal` still present > 9. Promote again (e.g., failback) — `standby.signal` removed, > `recovery.signal` still NOT removed > 10. If the node later needs to rejoin as standby via pg_rewind > (without `standby.signal` yet), it finds `recovery.signal`, > performs PITR recovery, and self-promotes to a new timeline > > I spent some time to understand this, and found in xlogrecovery.c: > > if (stat(STANDBY_SIGNAL_FILE, &stat_buf) == 0) > { > /* ... */ > standby_signal_file_found = true; > } > else if (stat(RECOVERY_SIGNAL_FILE, &stat_buf) == 0) > { > /* ... */ > recovery_signal_file_found = true; > } > > -- so the recovery.signal is not registered, Postgres doesn't know it exists. > > Cleanup logic for both files in xlog.c looks independent: > > if (endOfRecoveryInfo->standby_signal_file_found) > durable_unlink(STANDBY_SIGNAL_FILE, FATAL); > > if (endOfRecoveryInfo->recovery_signal_file_found) > durable_unlink(RECOVERY_SIGNAL_FILE, FATAL); > > -- but it cleans up only what it knows. So, recovery.signal is not cleaned. > > Concerns/questions: > > 1. I don't like the fact that recovery_signal_file_found is set to > false although the file is present -- this is hard to read and > troubleshoot... > 2. The comment in xlog.c says "The comment there even says "Remove the > signal files out of the way, so that we don't accidentally re-enter > archive recovery mode in a subsequent crash" -- but `recovery.signal` > escapes this cleanup. Looks like what's happening was not expected by > design, is it correct conclusion? > 3. It seems to me that having both files coexist is always a > misconfiguration -- there > is no use case where a node should be in both PITR and standby mode. > If it is so, maybe we should: > - at minimum, remove the orphaned recovery.signal when > standby.signal takes precedence (or at end of recovery) > - do not start if both files are present: consider it abnormal and > ask for explicit cleanup, so user (or tooling) could decide which file > needs to stay > > thoughts?
+1 on also cleaning up recovery.signal when both signal files are present. The documentation states that standby.signal takes precedence if both files exist, and this configuration is not described as unacceptable. So, it doesn't seem ok to prevent the server from starting in this case. Regards, -- Fujii Masao
