On Wed, 3 Apr 2019 at 19:57, Amit Khandekar <amitdkhan...@gmail.com> wrote: > Oops, it was my own change that caused the hang. Sorry for the noise. > After using wal_debug, found out that after replaying the LOCK records > for the catalog pg_auth, it was not releasing it because it had > actually got stuck in ReplicationSlotDropPtr() itself. In > ResolveRecoveryConflictWithSlots(), a shared > ReplicationSlotControlLock was already held before iterating through > the slots, and now ReplicationSlotDropPtr() again tries to take the > same lock in exclusive mode for setting slot->in_use, leading to a > deadlock. I fixed that by releasing the shared lock before calling > ReplicationSlotDropPtr(), and then re-starting the slots' scan over > again since we released it. We do similar thing for > ReplicationSlotCleanup(). > > Attached is a rebased version of your patch > logical-decoding-on-standby.patch. This v2 version also has the above > changes. It also includes the tap test file which is still in WIP > state, mainly because I have yet to add the conflict recovery handling > scenarios.
Attached v3 patch includes a new scenario to test conflict recovery handling by verifying that the conflicting slot gets dropped. WIth this, I am done with the test changes, except the below question that I had posted earlier which I would like to have inputs : Regarding the test result failures, I could see that when we drop a logical replication slot at standby server, then the catalog_xmin of physical replication slot becomes NULL, whereas the test expects it to be equal to xmin; and that's the reason a couple of test scenarios are failing : ok 33 - slot on standby dropped manually Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on master done not ok 34 - physical catalog_xmin still non-null not ok 35 - xmin and catalog_xmin equal after slot drop # Failed test 'xmin and catalog_xmin equal after slot drop' # at t/016_logical_decoding_on_replica.pl line 272. # got: # expected: 2584 I am not sure what is expected. What actually happens is : the physical xlot catalog_xmin remains NULL initially, but becomes non-NULL after the logical replication slot is created on standby. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
logical-decoding-on-standby_v3.patch
Description: Binary data