Hi,

A replication slot can be lost when a subscriber is not able to catch up
with the load on the primary and the WAL to catch up exceeds
max_slot_wal_keep_size. When this happens, target has to be reseeded
(pg_dump) from the scratch and this can take longer. I am investigating the
options to revive a lost slot. With the attached patch and copying the WAL
files from the archive to pg_wal directory I was able to revive the lost
slot. I also verified that a lost slot doesn't let vacuum cleanup the
catalog tuples deleted by any later transaction than catalog_xmin. One side
effect of this approach is that the checkpointer creating the .ready files
corresponds to the copied wal files in the archive_status folder. Archive
command has to handle this case. At the same time, checkpointer can
potentially delete the file again before the subscriber consumes the file
again. In the proposed patch, I am not setting restart_lsn
to InvalidXLogRecPtr but instead relying on invalidated_at field to tell if
the slot is lost. Is the intent of setting restart_lsn to InvalidXLogRecPtr
was to disallow reviving the slot?

If overall direction seems ok, I would continue on the work to revive the
slot by copying the wal files from the archive. Appreciate your feedback.

Thanks,
Sirisha

Attachment: 0001-Allow-revive-a-lost-replication-slot.patch
Description: Binary data

Reply via email to