Shreenidhi created HIVE-29644:
---------------------------------
Summary: HMS hang/deadlock during ACID replication: compaction
enqueue incorrectly runs inside replTableWriteIdState transaction
Key: HIVE-29644
URL: https://issues.apache.org/jira/browse/HIVE-29644
Project: Hive
Issue Type: Bug
Reporter: Shreenidhi
h3. Problem
During large Hive ACID bootstrap replication on the target (DR) cluster, HMS
can become unresponsive. Queries stall at compile time waiting to open
transactions. The issue requires HMS restart to recover.
Postgres {{pg_stat_activity}} shows multiple {{idle in transaction}}
connections on:
* {{AUX_TABLE}} ({{{}SELECT ... FOR UPDATE{}}} for {{CompactionScheduler}}
mutex)
* {{COMPACTION_QUEUE}} / {{NEXT_COMPACTION_QUEUE_ID}}
HMS logs show cross-node blocking between:
* HMS running replication ({{{}ReplTableWriteIdStateFunction{}}} /
{{{}repl_tbl_writeid_state{}}})
* HMS running compaction initiator ({{{}CompactFunction{}}} via
{{{}TxnHandler.compact{}}})
----
h3. Root cause
When replication applies ACID write-ID state for tables with aborted write IDs,
HMS schedules major compaction for each partition to clean aborted delta files.
Before HIVE-27481, {{TxnHandler.replTableWriteIdState}} worked correctly:
# Apply write-ID state in one DB transaction
# Commit
# Call separate {{compact()}} per partition (each with its own transaction)
After HIVE-27481 ({{{}TxnHandler cleanup{}}}), logic moved to
{{ReplTableWriteIdStateFunction}} inside a single
"{{{}@Transactional(POOL_TX)"{}}} method.
Compaction enqueue via {{CompactFunction}} was incorrectly inlined in the same
transaction as write-ID apply:
@Transactional(POOL_TX) replTableWriteIdState()
├── apply aborted write IDs, insert NEXT_WRITE_ID
└── for each partition:
CompactFunction.execute() // mutex (POOL_MUTEX) + NCQ lock
(POOL_TX)
└── commit (only at end)
This causes:
* {{NEXT_COMPACTION_QUEUE_ID}} row lock held across all partition enqueues in
one long transaction
* Repeated acquisition of {{CompactionScheduler}} mutex across loop iterations
* Cross-connection lock contention / AB-BA deadlock with concurrent
{{compact()}} (initiator, another replication job, or manual compact)
Manual {{ALTER TABLE ... COMPACT 'major'}} does not exhibit this because each
{{compact()}} is a separate {{@Transactional(POOL_TX)}} call that commits
immediately — same as pre-HIVE-27481 behavior.
----
h3. Locking details
Compaction enqueue uses two DB connections:
||Connection||Lock||Purpose||
|POOL_MUTEX|{{AUX_TABLE}} CompactionScheduler|Serialize compaction scheduling|
|POOL_TX|{{NEXT_COMPACTION_QUEUE_ID}} FOR UPDATE|Generate unique compaction
queue ID|
Deadlock/contention occurs when:
* Thread A holds NCQ lock (long repl txn) and waits for mutex (next partition
iteration)
* Thread B holds mutex (inside {{{}CompactFunction{}}}) and waits for NCQ lock
Disabling compactor initiator on DR reduces but does not eliminate risk —
concurrent replication jobs alone can trigger the same pattern.
----
h3. Regression introduced by
HIVE-27481 — {{TxnHandler cleanup}} (Dec 2023)
File: {{ReplTableWriteIdStateFunction.java}} — inlined {{CompactFunction}} loop
inside {{@Transactional(POOL_TX)}} {{{}replTableWriteIdState{}}}.
Pre-HIVE-27481 code explicitly committed write-ID state first, then called
{{compact()}} separately per partition.
----
h3. Proposed fix
Restore pre-HIVE-27481 behavior in the refactored code
--
This message was sent by Atlassian Jira
(v8.20.10#820010)