Hello,
For some unknown reason (probably a very big transaction at the source), we
experienced a logical decoding breakdown,
due to a timeout from the subscriber side (either wal_receiver_timeout or
connexion drop by network equipment due to inactivity).
The problem is, that due to that failure, the wal_receiver process stops. When
the wal_sender is ready to send some data, it finds the connexion broken and
exits.
A new wal_sender process is created that restarts from the beginning (restart
LSN). This is an endless loop.
Checking the network connexion between wal_sender and wal_receiver, we found
that no traffic occurs for hours.
We first increased wal_receiver_timeout up to 12h and still got a disconnection
on the receiver party:
2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR:
terminating logical replication worker due to timeout
2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG: background
worker "logical replication worker" (PID 1356203) exited with exit code 1
Then put this parameter to 0, but got then disconnected by the network (note
the slight difference in message):
2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR: could not
receive data from WAL stream: could not receive data from server: Connection
timed out
2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG:
background worker "logical replication worker" (PID 1697787) exited with exit
code 1
The message is generated in libpqrcv_receive function
(replication/libpqwalreceiver/libpqwalreceiver.c) which calls pqsecure_raw_read
(interfaces/libpq/fe-secure.c)
The last message "Connection timed out" is the errno translation from the recv
system function:
ETIMEDOUT Connection timed out (POSIX.1-2001)
When those timeout occurred, the sender was still busy deleting files from
data/pg_replslot/bdcpb21_sene, accumulating more than 6 millions small ".spill"
files.
It seems this very long pause is at cleanup stage were PG is blindly trying to
delete those files.
strace on wal sender show tons of calls like:
unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1
ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-1000000.spill") = -1
ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-2000000.spill") = -1
ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-3000000.spill") = -1
ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-4000000.spill") = -1
ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-5000000.spill") = -1
ENOENT (Aucun fichier ou dossier de ce type)
This occurs in ReorderBufferRestoreCleanup
(backend/replication/logical/reorderbuffer.c).
The call stack presumes this may probably occur in DecodeCommit or DecodeAbort
(backend/replication/logical/decode.c):
unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-88000000.spill") = -1
ENOENT (Aucun fichier ou dossier de ce type)
> /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7]
> /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d)
> [0x769e3d]
> /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <===
> replication/logical/reorderbuff.c:1480 (mais cette fonction (static) n'est
> utiliée qu'au sein de ce module ...)
> /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <===
> replication/logical/decode.c:175
> /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] <===
> replication/logical/decode.c:90, appelle la fonction rmgr.rm_decode(ctx,
> &buf) = 1 des 6 méthodes du resource manager
> /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e]
> /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121]
> /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a]
> /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588]
> /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97]
> /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c]
> /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05]
> /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554]
> /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8]
We did not find any other option than deleting the subscription to stop that
loop and start a new one (thus loosing transactions).
The publisher is PostgreSQL 15.6
The subscriber is PostgreSQL 14.5
Thanks