On 25.12.2019 07:03, Kyotaro Horiguchi wrote:
At Tue, 24 Dec 2019 20:12:32 +0300, Alexey Kondratov 
<a.kondra...@postgrespro.ru> wrote in
I dig into the code and it happens because of this if statement:

     /* Update the on disk state when lsn was updated. */
     if (XLogRecPtrIsInvalid(endlsn))
     {
         ReplicationSlotMarkDirty();
         ReplicationSlotsComputeRequiredXmin(false);
         ReplicationSlotsComputeRequiredLSN();
         ReplicationSlotSave();
     }
Yes, it seems just broken.

Attached is a small patch, which fixes this bug. I have tried to
stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))'
and now pg_logical_replication_slot_advance and
pg_physical_replication_slot_advance return InvalidXLogRecPtr if
no-op.

What do you think?
I think we shoudn't change the definition of
pg_*_replication_slot_advance since the result is user-facing.

Yes, that was my main concern too. OK.

The functions return a invalid value only when the slot had the
invalid value and failed to move the position. I think that happens
only for uninitalized slots.

Anyway what we should do there is dirtying the slot when the operation
can be assumed to have been succeeded.

As the result I think what is needed there is just checking if the
returned lsn is equal or larger than moveto. Doen't the following
change work?

-       if (XLogRecPtrIsInvalid(endlsn))
+       if (moveto <= endlsn)

Yep, it helps with physical replication slot persistence after advance, but the whole validation (moveto <= endlsn) does not make sense for me. The value of moveto should be >= than minlsn == confirmed_flush / restart_lsn, while endlsn == retlsn is also always initialized with confirmed_flush / restart_lsn. Thus, your condition seems to be true in any case, even if it was no-op one, which we were intended to catch.

Actually, if we do not want to change pg_*_replication_slot_advance, we can just add straightforward validation that either confirmed_flush, or restart_lsn changed after slot advance guts execution. It will be a little bit bulky, but much more clear and will never be affected by pg_*_replication_slot_advance logic change.


Another weird part I have found is this assignment inside pg_logical_replication_slot_advance:

/* Initialize our return value in case we don't do anything */
retlsn = MyReplicationSlot->data.confirmed_flush;

It looks redundant, since later we do the same assignment, which should be reachable in any case.

I will recheck everything again and try to come up with something during this week.


Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Reply via email to