Hi, Vitaly! On Mon, Mar 3, 2025 at 5:12 PM Vitaly Davydov <v.davy...@postgrespro.ru> wrote: > The slot data is flushed to the disk at the beginning of checkpoint. If > an existing slot is advanced in the middle of checkpoint execution, its > advanced restart LSN is taken to calculate the oldest LSN for WAL > segments removal at the end of checkpoint. If the node is restarted just > after the checkpoint, the slots data will be read from the disk at > recovery with the oldest restart LSN which can refer to removed WAL > segments. > > The patch introduces a new in-memory state for slots - > flushed_restart_lsn which is used to calculate the oldest LSN for WAL > segments removal. This state is updated every time with the current > restart_lsn at the moment, when the slot is saving to disk.
Thank you for your work on this subject. I think generally your approach is correct. When we're truncating the WAL log, we need to reply on the position that would be used in the case of server crush. That is the position flushed to the disk. While your patch is generality looks good, I'd like make following notes: 1) As ReplicationSlotsComputeRequiredLSN() is called each time we need to advance the position of WAL needed by replication slots, the usage pattern probably could be changed. Thus, we probably need to call ReplicationSlotsComputeRequiredLSN() somewhere after change of restart_lsn_flushed while restart_lsn is not changed. And probably can skip ReplicationSlotsComputeRequiredLSN() in some cases when only restart_lsn is changed. 2) I think it's essential to include into the patch test caches which fail without patch. You could start from integrating [1] test into your patch, and then add more similar tests for different situations. Links. 1. https://www.postgresql.org/message-id/e3ac0535-e7a2-4a96-9b36-9f765e9cfec5%40vondra.me ------ Regards, Alexander Korotkov Supabase