[
https://issues.apache.org/jira/browse/HUDI-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-7308:
---------------------------------
Labels: pull-request-available (was: )
> LockManager::unlock should not call updateLockHeldTimerMetrics if
> lockDurationTimer has not been started
> --------------------------------------------------------------------------------------------------------
>
> Key: HUDI-7308
> URL: https://issues.apache.org/jira/browse/HUDI-7308
> Project: Apache Hudi
> Issue Type: Bug
> Components: metrics, multi-writer
> Affects Versions: 1.0.0-beta1
> Reporter: Krishen Bhan
> Priority: Trivial
> Labels: pull-request-available
>
> If an exception is thrown in
> org.apache.hudi.client.transaction.lock.LockManager#lock it is possible for
> lockDurationTimer in HoodieLockMetrics to be closed by the user before it is
> started, which throws and bubbles up a `HoodieException("Timer was not
> started")` exception (rather than the actual exception that occurred when
> trying to acquire lock). Specifically, this can happen due to following
> scenario
> # The BaseHoodieTableServiceClient calls
> `org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering` ,
> which in turn calls
> `org.apache.hudi.client.transaction.TransactionManager#beginTransaction`
> within a `try` block
> # During
> `org.apache.hudi.client.transaction.TransactionManager#beginTransaction` the
> LockManager lock API
> `org.apache.hudi.client.transaction.lock.LockManager#lock` is called
> # Inside ``org.apache.hudi.client.transaction.lock.LockManager#lock` , the
> `java.util.concurrent.locks.Lock#tryLock(long,
> java.util.concurrent.TimeUnit)` throws some exception. Because of this
> exception, the statement
> `metrics.updateLockAcquiredMetric();` is not executed. This means that the
> `org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called
> for the timer HoodieLockMetrics member variable
> `org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer`
>
> # The exception in (3) bubbles up back to
> `org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering`.
> Since this is in a `try` block, the `catch` and `finally` blocks are
> executed. When `finally` is executed though,
> `org.apache.hudi.client.transaction.TransactionManager#endTransaction` is
> called
> # During
> `org.apache.hudi.client.transaction.TransactionManager#endTransaction` the
> LockManager unlock API
> `org.apache.hudi.client.transaction.lock.LockManager#unlock` is called.
> During the execution of `metrics.updateLockHeldTimerMetrics();` , The
> method `org.apache.hudi.common.util.HoodieTimer#endTimer` is called for
> `org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer`
> . This throws an exception ` `HoodieException("Timer was not started")` This
> is because the corresponding
> `org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called
> The issue here is that the caller ('BaseHoodieTableServiceClient` in this
> case) should have ended up re-throwing the exception thrown in (3) while
> trying to start the transaction in
> `org.apache.hudi.client.transaction.TransactionManager#startTransaction`.
> Instead though, because the caller safely "cleaned up" by calling
> `org.apache.hudi.client.transaction.TransactionManager#endTransaction` (in a
> `finally`), the `HoodieException("Timer was not started")` exception was
> raised instead, suppressing the exception from (3), which is the actual root
> cause issue. Instead, the execution of
> `org.apache.hudi.client.transaction.TransactionManager#endTransaction` should
> have executed without throwing this additional exception, which would have
> lead the caller to throw the exception in (3) before exiting.
> Although resolving this would not prevent the overall operation from failing,
> it would provide better observability on the actual root cause exception (the
> one from (3)).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)