From d9d97f1179f46b1c3763ed9339d8c75019583d55 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 8 Dec 2023 17:07:51 +0530
Subject: [PATCH v3] Fix an undetected deadlock due to apply worker.

The apply worker needs to update the state of the subscription tables to
'READY' during the synchronization phase which requires locking the
corresponding subscription. The apply worker also waits for the
subscription tables to reach the 'SYNCDONE' state after holding the locks
on the subscription and the wait is done using WaitLatch. The 'SYNCDONE'
state is changed by tablesync workers again by locking the corresponding
subscription. Both the state updates use AccessShareLock mode to lock the
subscription, so they can't block each other. However, a backend can
simultaneously try to acquire a lock on the same subscription using
AccessExclusiveLock mode to alter the subscription. Now, the backend's
wait on a lock can sneak in between the apply worker and table sync worker
causing deadlock.

In other words, apply_worker waits for tablesync worker which waits for
backend, and backend waits for apply worker. This is not detected by the
deadlock detector because apply worker uses WaitLatch.

The fix is to release existing locks in apply worker before it starts to
wait for tablesync worker to change the state.

Reported-by: Tomas Vondra
Author: Shlok Kyal
Reviewed-by: Amit Kapila
Backpatch-through: 12
Discussion: https://postgr.es/m/d291bb50-12c4-e8af-2af2-7bb9bb4d8e3e@enterprisedb.com
---
 src/backend/replication/logical/tablesync.c | 23 ++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index df3c42eb5d..2bdc1bab55 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -541,16 +541,29 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 					/* Now safe to release the LWLock */
 					LWLockRelease(LogicalRepWorkerLock);
 
-					/*
-					 * Enter busy loop and wait for synchronization worker to
-					 * reach expected state (or die trying).
-					 */
-					if (!started_tx)
+					if (started_tx)
+					{
+						/*
+						 * We must commit the existing transaction to release
+						 * the existing locks before entering a busy loop. This
+						 * is required to avoid any undetected deadlocks due to
+						 * any existing lock as deadlock detector won't be able
+						 * to detect the waits on the latch.
+						 */
+						CommitTransactionCommand();
+						pgstat_report_stat(false);
+						StartTransactionCommand();
+					}
+					else
 					{
 						StartTransactionCommand();
 						started_tx = true;
 					}
 
+					/*
+					 * Enter busy loop and wait for synchronization worker to
+					 * reach expected state (or die trying).
+					 */
 					wait_for_relation_state_change(rstate->relid,
 												   SUBREL_STATE_SYNCDONE);
 				}
-- 
2.28.0.windows.1

