[ 
https://issues.apache.org/jira/browse/HUDI-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7308:
-----------------------------
    Fix Version/s: 1.0.0

> LockManager::unlock should not call updateLockHeldTimerMetrics if 
> lockDurationTimer has not been started
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-7308
>                 URL: https://issues.apache.org/jira/browse/HUDI-7308
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: metrics, multi-writer
>    Affects Versions: 1.0.0-beta1
>            Reporter: Krishen Bhan
>            Priority: Trivial
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>
> If an exception is thrown in 
> org.apache.hudi.client.transaction.lock.LockManager#lock it is possible for 
> lockDurationTimer in HoodieLockMetrics to be closed by the user before it is 
> started, which throws and bubbles up a `HoodieException("Timer was not 
> started")`  exception (rather than the actual exception that occurred when 
> trying to acquire lock). Specifically, this can happen due to following 
> scenario
>  # The BaseHoodieTableServiceClient calls 
> `org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering` , 
> which in turn calls 
> `org.apache.hudi.client.transaction.TransactionManager#beginTransaction` 
> within a `try` block
>  # During 
> `org.apache.hudi.client.transaction.TransactionManager#beginTransaction` the 
> LockManager lock API 
> `org.apache.hudi.client.transaction.lock.LockManager#lock` is called
>  # Inside ``org.apache.hudi.client.transaction.lock.LockManager#lock` , the 
> `java.util.concurrent.locks.Lock#tryLock(long, 
> java.util.concurrent.TimeUnit)` throws some exception. Because of this 
> exception, the statement
> `metrics.updateLockAcquiredMetric();` is not executed. This means that the 
> `org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called 
> for the timer HoodieLockMetrics member variable 
> `org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer`
>  
>  # The exception in (3) bubbles up back to 
> `org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering`. 
> Since this is in a `try` block, the `catch` and `finally` blocks are 
> executed. When `finally` is executed though, 
> `org.apache.hudi.client.transaction.TransactionManager#endTransaction` is 
> called
>  # During 
> `org.apache.hudi.client.transaction.TransactionManager#endTransaction` the 
> LockManager unlock API 
> `org.apache.hudi.client.transaction.lock.LockManager#unlock` is called. 
> During the  execution of `metrics.updateLockHeldTimerMetrics();`  ,  The 
> method `org.apache.hudi.common.util.HoodieTimer#endTimer` is called for 
> `org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer`
>  . This throws an exception ` `HoodieException("Timer was not started")` This 
> is  because the corresponding 
> `org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called
> The issue here is that the caller ('BaseHoodieTableServiceClient` in this 
> case) should have ended up re-throwing the exception thrown in (3) while 
> trying to start the transaction in 
> `org.apache.hudi.client.transaction.TransactionManager#startTransaction`. 
> Instead though, because the caller safely "cleaned up" by calling 
> `org.apache.hudi.client.transaction.TransactionManager#endTransaction` (in a 
> `finally`), the `HoodieException("Timer was not started")` exception was 
> raised instead, suppressing the exception from (3), which is the actual root 
> cause issue. Instead, the execution of 
> `org.apache.hudi.client.transaction.TransactionManager#endTransaction` should 
> have executed without throwing this additional exception, which would have 
> lead the caller to throw the exception in (3) before exiting.
> Although resolving this would not prevent the overall operation from failing, 
> it would provide better observability on the actual root cause exception (the 
> one from (3)). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to