Manoj Govindassamy created HUDI-3029:
----------------------------------------
Summary: TransactionManager synchronized begin/endTransaction()
leading to deadlock
Key: HUDI-3029
URL: https://issues.apache.org/jira/browse/HUDI-3029
Project: Apache Hudi
Issue Type: Task
Components: Writer Core
Reporter: Manoj Govindassamy
Assignee: Manoj Govindassamy
Fix For: 0.11.0
I see the TransactionManager has begin and end transactions as synchronized
methods. Based on the lock provider implementation, this can have adverse
effects. Say the lock provider has the blocking call for the lock() or
tryLock() (which is genereally the case), then the following sequence will lead
to a deadlock.
Client 1: beginTransaction() => txn manager instance lock acquired, lock()
went through, instance lock released
Client 2: beginTransaction() => txn manager instance lock acquired, lock() is
blocking
Cilent 3: endTransaction() => Waiting to lock the txn manager instance to enter
the synchronized method
{noformat}
public synchronized void beginTransaction(Option<HoodieInstant>
currentTxnOwnerInstant, Option<HoodieInstant> lastCompletedTxnOwnerInstant) {
if (supportsOptimisticConcurrency) {
this.lastCompletedTxnOwnerInstant = lastCompletedTxnOwnerInstant;
lockManager.setLatestCompletedWriteInstant(lastCompletedTxnOwnerInstant);
LOG.info("Latest completed transaction instant " +
lastCompletedTxnOwnerInstant);
this.currentTxnOwnerInstant = currentTxnOwnerInstant;
LOG.info("Transaction starting with transaction owner " +
currentTxnOwnerInstant);
lockManager.lock();
LOG.info("Transaction started");
}
}
public synchronized void endTransaction() {
if (supportsOptimisticConcurrency) {
LOG.info("Transaction ending with transaction owner " +
currentTxnOwnerInstant);
lockManager.unlock();
LOG.info("Transaction ended");
this.lastCompletedTxnOwnerInstant = Option.empty();
lockManager.resetLatestCompletedWriteInstant();
}
}{noformat}
The reason why it may be working with the current model is when the lock
provider implementation of tryLock() has sleep() or retry with timeout etc.,
But, we can't assume on the lock provider implementation at the transaction
manager layer.
cc: [~nishith29] [~shivnarayan]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)