On Mon, Nov 21, 2022 at 4:31 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Sat, Nov 19, 2022 at 6:35 AM Andres Freund <and...@anarazel.de> wrote: > > > > On 2022-11-18 11:20:36 +0530, Amit Kapila wrote: > > > Okay, updated the patch accordingly. > > > > Assuming it passes tests etc, this'd work for me. > > > > Thanks, Pushed.
The same assertion failure has been reported on another thread[1]. Since I could reproduce this issue several times in my environment I've investigated the root cause. I think there is a race condition of updating procArray->replication_slot_xmin by CreateInitDecodingContext() and LogicalConfirmReceivedLocation(). What I observed in the test was that a walsender process called: SnapBuildProcessRunningXacts() LogicalIncreaseXminForSlot() LogicalConfirmReceivedLocation() ReplicationSlotsComputeRequiredXmin(false). In ReplicationSlotsComputeRequiredXmin() it acquired the ReplicationSlotControlLock and got 0 as the minimum xmin since there was no wal sender having effective_xmin. Before calling ProcArraySetReplicationSlotXmin() (i.e. before acquiring ProcArrayLock), another walsender process called CreateInitDecodingContext(), acquired ProcArrayLock, computed slot->effective_catalog_xmin, called ReplicationSlotsComputeRequiredXmin(true). Since its effective_catalog_xmin had been set, it got 39968 as the minimum xmin, and updated replication_slot_xmin. However, as soon as the second walsender released ProcArrayLock, the first walsender updated the replication_slot_xmin to 0. After that, the second walsender called SnapBuildInitialSnapshot(), and GetOldestSafeDecodingTransactionId() returned an XID newer than snap->xmin. One idea to fix this issue is that in ReplicationSlotsComputeRequiredXmin(), we compute the minimum xmin while holding both ProcArrayLock and ReplicationSlotControlLock, and release only ReplicationSlotsControlLock before updating the replication_slot_xmin. I'm concerned it will increase the contention on ProcArrayLock but I've attached the patch for discussion. Regards, [1] https://www.postgresql.org/message-id/tencent_7EB71DA5D7BA00EB0B429DCE45D0452B6406%40qq.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
fix_slot_xmin_race_condition.patch
Description: Binary data