Krishen Bhan created HUDI-7308:
----------------------------------
Summary: LockManager::unlock should not call
updateLockHeldTimerMetrics if lockDurationTimer has not been started
Key: HUDI-7308
URL: https://issues.apache.org/jira/browse/HUDI-7308
Project: Apache Hudi
Issue Type: Bug
Components: metrics, multi-writer
Affects Versions: 1.0.0-beta1
Reporter: Krishen Bhan
If an exception is thrown in
org.apache.hudi.client.transaction.lock.LockManager#lock it is possible for
lockDurationTimer in HoodieLockMetrics to be closed by the user before it is
started, which throws and bubbles up a `HoodieException("Timer was not
started")` exception (rather than the actual exception that occurred when
trying to acquire lock). Specifically, this can happen due to following scenario
# The BaseHoodieTableServiceClient calls
`org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering` ,
which in turn calls
`org.apache.hudi.client.transaction.TransactionManager#beginTransaction` within
a `try` block
# During
`org.apache.hudi.client.transaction.TransactionManager#beginTransaction` the
LockManager lock API `org.apache.hudi.client.transaction.lock.LockManager#lock`
is called
# Inside ``org.apache.hudi.client.transaction.lock.LockManager#lock` , the
`java.util.concurrent.locks.Lock#tryLock(long, java.util.concurrent.TimeUnit)`
throws some exception. Because of this exception, the statement
`metrics.updateLockAcquiredMetric();` is not executed. This means that the
`org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called
for the timer HoodieLockMetrics member variable
`org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer`
# The exception in (3) bubbles up back to
`org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering`. Since
this is in a `try` block, the `catch` and `finally` blocks are executed. When
`finally` is executed though,
`org.apache.hudi.client.transaction.TransactionManager#endTransaction` is called
# During
`org.apache.hudi.client.transaction.TransactionManager#endTransaction` the
LockManager unlock API
`org.apache.hudi.client.transaction.lock.LockManager#unlock` is called. During
the execution of `metrics.updateLockHeldTimerMetrics();` , The method
`org.apache.hudi.common.util.HoodieTimer#endTimer` is called for
`org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer`
. This throws an exception ` `HoodieException("Timer was not started")` This
is because the corresponding
`org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called
The issue here is that the caller ('BaseHoodieTableServiceClient` in this case)
should have ended up re-throwing the exception thrown in (3) while trying to
start the transaction in
`org.apache.hudi.client.transaction.TransactionManager#startTransaction`.
Instead though, because the caller safely "cleaned up" by calling
`org.apache.hudi.client.transaction.TransactionManager#endTransaction` (in a
`finally`), the `HoodieException("Timer was not started")` exception was raised
instead, suppressing the exception from (3), which is the actual root cause
issue. Instead, the execution of
`org.apache.hudi.client.transaction.TransactionManager#endTransaction` should
have executed without throwing this additional exception, which would have lead
the caller to throw the exception in (3) before exiting.
Although resolving this would not prevent the overall operation from failing,
it would provide better observability on the actual root cause exception (the
one from (3)).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)