From 549ebeac998a0f81d0b2a99191db84e153dcc582 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 24 Nov 2023 16:37:14 +0530
Subject: [PATCH v1] Deadlock when apply worker,tablesync worker and client
 backend run parallelly.

Apply worker holds a lock on the table pg_subscription_rel and waits for
notification from the tablesync workers where the relation is synced, which
happens through updating the pg_subscription_rel row. And that wait happens in
wait_for_relation_state_change, which simply checks the row in a loop, with a
sleep by WaitLatch(). Unfortunately, the tablesync workers can't update the row
because the client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
sneaked in, and waits for an AccessExclusiveLock. So the tablesync workers are
stuck in the queue and can't proceed.

The client backend is waiting for a lock held by the apply worker. The tablesync
workers can't proceed because their lock request is stuck behind the
AccessExclusiveLock request. And the apply worker is waiting for status update
from the tablesync workers. And the deadlock is undetected, because the apply
worker is not waiting on a lock, but sleeping on a latch.

We have resolved the issue by releasing the lock by commiting the transaction
before the apply worker goes to wait state.
---
 src/backend/replication/logical/tablesync.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index df3c42eb5d..e2b3ce64cb 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -542,15 +542,28 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 					LWLockRelease(LogicalRepWorkerLock);
 
 					/*
-					 * Enter busy loop and wait for synchronization worker to
-					 * reach expected state (or die trying).
+					 * If we have a transaction, we must commit it to release
+					 * any locks we have (it's quite likely we hold lock on
+					 * pg_replication_origin, which the sync worker will need
+					 * to update).  Then start a new transaction so we can
+					 * examine catalog state.
 					 */
-					if (!started_tx)
+					if (started_tx)
+					{
+						CommitTransactionCommand();
+						pgstat_report_stat(false);
+						StartTransactionCommand();
+					}
+					else
 					{
 						StartTransactionCommand();
 						started_tx = true;
 					}
 
+					/*
+					 * Enter busy loop and wait for synchronization worker to
+					 * reach expected state (or die trying).
+					 */
 					wait_for_relation_state_change(rstate->relid,
 												   SUBREL_STATE_SYNCDONE);
 				}
-- 
2.34.1

