Thanks Bharath for your response,

> You mentioned that
> the connection to primary was lost, so you need to dive deep as to why
> it got lost. If the connection was lost half-way through fetching the
> WAL record, the standby may emit such a LOG message.

The connection was lost due to bad network. Currently we are okay with bad 
network, because applications in general has to always expect bad network in 
their design.

To simulate the bad network, we manually killed the wal-sender process in 
primary multiple times, this should be same as primary unable to send messages 
to standby due to bad network. And in these experiments, the standby is able to 
join the primary after checking WAL files in archive, within few seconds.

But, when the standby gets disconnected due to bad network, the standby is 
unable to join back, so we wanted to understand why this is happening.


> I understand this problem and there's a proposed patch to help with
> this - 
> https://www.postgresql.org/message-id/CALj2ACVryN_PdFmQkbhga1VeW10VgQ4Lv9JXO=3njkvzt8q...@mail.gmail.com.
> 
> It basically allows one to set a timeout as to how much duration the
> standby can restore from archive before switching to stream.
> Therefore, in your case, the standby doesn't have to wait for 1hr to
> connect to primary, but it can connect before that.

Good to know about this patch, because we tried to do something similar to this.

But this patch will not solve our problem, because here the hot-standby is NOT 
downloading WAL files from the archive for very long time and wasting time, 
but,  the standby simply switching the sources from STREAM to ARCHIVE in a 
tight loop doing nothing.

In fact, the standby is trying to connect to the primary for streaming back 
after it fails to pull the latest WAL file from archive. But trying to stream 
from the primary fails again immediately.


To explain this situation better, I am adding more logs.

Note: the logs mentioned here includes the custom logging we enable for 
debugging purposes.



2023-02-09 19:52:30.909 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Starting 
the state machine again from starting, changing current source to 
XLOG_FROM_ARCHIVE
2023-02-09 19:52:30.909 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched 
WAL source from stream to archive after failure
ERROR: 2023/02/09 19:52:31.616383 Archive '00000006000000020000001C' does not 
exist.
2023-02-09 19:52:31.618 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  
Successfully read the WAL file using XLogFileReadAnyTLI from archive or 
existing pg_wal, returning true and exit.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  last 
source archive failed, and now switching to new source.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Moving to 
XLOG_FROM_STREAM state and start walreceiver if necessary
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched 
WAL source from archive to stream after failure with WalStreamingPreferred false
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  changing 
the current file timeline to new tli.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  we have 
data from wal receiver.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  after 
streaming file, we dont have WAL file opened, so read now.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Due to WAL 
receiver, we have WAL file present and opened and hence returning true
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  invalid 
record length at 2/1C0324C8: wanted 24, got 0
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  last 
source stream failed, and now switching to new source.
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Failure 
while streaming, we might have found invalid record in WAL streamed from master
2023-02-09 19:52:31.619 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Stopping 
WAL receiver as we saw serious failure while streaming
INFO: 2023/02/09 19:52:31.796489 WAL: 00000007.history will be downloaded from 
archive
ERROR: 2023/02/09 19:52:32.330491 Archive '00000007.history' does not exist.
2023-02-09 19:52:32.333 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  We were 
requested to recover to latest timeline, but rescan is NOT needed.
2023-02-09 19:52:32.333 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  We decided 
to sleep before retry the state-machine
2023-02-09 19:52:35.910 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Starting 
the state machine again from starting, changing current source to 
XLOG_FROM_ARCHIVE
2023-02-09 19:52:35.910 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched 
WAL source from stream to archive after failure
ERROR: 2023/02/09 19:52:36.607613 Archive '00000006000000020000001C' does not 
exist.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  
Successfully read the WAL file using XLogFileReadAnyTLI from archive or 
existing pg_wal, returning true and exit.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  last 
source archive failed, and now switching to new source.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Moving to 
XLOG_FROM_STREAM state and start walreceiver if necessary
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  switched 
WAL source from archive to stream after failure with WalStreamingPreferred false
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  changing 
the current file timeline to new tli.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  we have 
data from wal receiver.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  after 
streaming file, we dont have WAL file opened, so read now.
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  Due to WAL 
receiver, we have WAL file present and opened and hence returning true
2023-02-09 19:52:36.610 GMT [  ] () [0 | 00000 | 63e4244f.54]: LOG:  invalid 
record length at 2/1C0324C8: wanted 24, got 0



So, we are wondering, why killing of wal-sender in primary abruptly seems okay 
but a few network failures are not okay for the standby to recover.

Thank you.

Best,
Harinath




Reply via email to