Shreenidhi created HIVE-29644:
---------------------------------

             Summary: HMS hang/deadlock during ACID replication: compaction 
enqueue incorrectly runs inside replTableWriteIdState transaction
                 Key: HIVE-29644
                 URL: https://issues.apache.org/jira/browse/HIVE-29644
             Project: Hive
          Issue Type: Bug
            Reporter: Shreenidhi


h3. Problem

During large Hive ACID bootstrap replication on the target (DR) cluster, HMS 
can become unresponsive. Queries stall at compile time waiting to open 
transactions. The issue requires HMS restart to recover.

Postgres {{pg_stat_activity}} shows multiple {{idle in transaction}} 
connections on:
 * {{AUX_TABLE}} ({{{}SELECT ... FOR UPDATE{}}} for {{CompactionScheduler}} 
mutex)
 * {{COMPACTION_QUEUE}} / {{NEXT_COMPACTION_QUEUE_ID}}

HMS logs show cross-node blocking between:
 * HMS running replication ({{{}ReplTableWriteIdStateFunction{}}} / 
{{{}repl_tbl_writeid_state{}}})
 * HMS running compaction initiator ({{{}CompactFunction{}}} via 
{{{}TxnHandler.compact{}}})

----
h3. Root cause

When replication applies ACID write-ID state for tables with aborted write IDs, 
HMS schedules major compaction for each partition to clean aborted delta files.

Before HIVE-27481, {{TxnHandler.replTableWriteIdState}} worked correctly:
 # Apply write-ID state in one DB transaction
 # Commit
 # Call separate {{compact()}} per partition (each with its own transaction)

After HIVE-27481 ({{{}TxnHandler cleanup{}}}), logic moved to 
{{ReplTableWriteIdStateFunction}} inside a single 
"{{{}@Transactional(POOL_TX)"{}}} method. 

Compaction enqueue via {{CompactFunction}} was incorrectly inlined in the same 
transaction as write-ID apply:
@Transactional(POOL_TX) replTableWriteIdState()
├── apply aborted write IDs, insert NEXT_WRITE_ID
└── for each partition:
                  CompactFunction.execute() // mutex (POOL_MUTEX) + NCQ lock 
(POOL_TX)
└── commit (only at end)
This causes:
 * {{NEXT_COMPACTION_QUEUE_ID}} row lock held across all partition enqueues in 
one long transaction
 * Repeated acquisition of {{CompactionScheduler}} mutex across loop iterations
 * Cross-connection lock contention / AB-BA deadlock with concurrent 
{{compact()}} (initiator, another replication job, or manual compact)

Manual {{ALTER TABLE ... COMPACT 'major'}} does not exhibit this because each 
{{compact()}} is a separate {{@Transactional(POOL_TX)}} call that commits 
immediately — same as pre-HIVE-27481 behavior.
----
h3. Locking details

Compaction enqueue uses two DB connections:
||Connection||Lock||Purpose||
|POOL_MUTEX|{{AUX_TABLE}} CompactionScheduler|Serialize compaction scheduling|
|POOL_TX|{{NEXT_COMPACTION_QUEUE_ID}} FOR UPDATE|Generate unique compaction 
queue ID|

Deadlock/contention occurs when:
 * Thread A holds NCQ lock (long repl txn) and waits for mutex (next partition 
iteration)
 * Thread B holds mutex (inside {{{}CompactFunction{}}}) and waits for NCQ lock

Disabling compactor initiator on DR reduces but does not eliminate risk — 
concurrent replication jobs alone can trigger the same pattern.
----
h3. Regression introduced by

HIVE-27481 — {{TxnHandler cleanup}} (Dec 2023)
File: {{ReplTableWriteIdStateFunction.java}} — inlined {{CompactFunction}} loop 
inside {{@Transactional(POOL_TX)}} {{{}replTableWriteIdState{}}}.

Pre-HIVE-27481 code explicitly committed write-ID state first, then called 
{{compact()}} separately per partition.
----
h3. Proposed fix

Restore pre-HIVE-27481 behavior in the refactored code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to