Krishen Bhan created HUDI-7308:
----------------------------------

             Summary: LockManager::unlock should not call 
updateLockHeldTimerMetrics if lockDurationTimer has not been started
                 Key: HUDI-7308
                 URL: https://issues.apache.org/jira/browse/HUDI-7308
             Project: Apache Hudi
          Issue Type: Bug
          Components: metrics, multi-writer
    Affects Versions: 1.0.0-beta1
            Reporter: Krishen Bhan


If an exception is thrown in 
org.apache.hudi.client.transaction.lock.LockManager#lock it is possible for 
lockDurationTimer in HoodieLockMetrics to be closed by the user before it is 
started, which throws and bubbles up a `HoodieException("Timer was not 
started")`  exception (rather than the actual exception that occurred when 
trying to acquire lock). Specifically, this can happen due to following scenario
 # The BaseHoodieTableServiceClient calls 
`org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering` , 
which in turn calls 
`org.apache.hudi.client.transaction.TransactionManager#beginTransaction` within 
a `try` block
 # During 
`org.apache.hudi.client.transaction.TransactionManager#beginTransaction` the 
LockManager lock API `org.apache.hudi.client.transaction.lock.LockManager#lock` 
is called
 # Inside ``org.apache.hudi.client.transaction.lock.LockManager#lock` , the 
`java.util.concurrent.locks.Lock#tryLock(long, java.util.concurrent.TimeUnit)` 
throws some exception. Because of this exception, the statement
`metrics.updateLockAcquiredMetric();` is not executed. This means that the 
`org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called 
for the timer HoodieLockMetrics member variable 
`org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer`
 
 # The exception in (3) bubbles up back to 
`org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering`. Since 
this is in a `try` block, the `catch` and `finally` blocks are executed. When 
`finally` is executed though, 
`org.apache.hudi.client.transaction.TransactionManager#endTransaction` is called
 # During 
`org.apache.hudi.client.transaction.TransactionManager#endTransaction` the 
LockManager unlock API 
`org.apache.hudi.client.transaction.lock.LockManager#unlock` is called. During 
the  execution of `metrics.updateLockHeldTimerMetrics();`  ,  The method 
`org.apache.hudi.common.util.HoodieTimer#endTimer` is called for 
`org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer`
 . This throws an exception ` `HoodieException("Timer was not started")` This 
is  because the corresponding 
`org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called

The issue here is that the caller ('BaseHoodieTableServiceClient` in this case) 
should have ended up re-throwing the exception thrown in (3) while trying to 
start the transaction in 
`org.apache.hudi.client.transaction.TransactionManager#startTransaction`. 
Instead though, because the caller safely "cleaned up" by calling 
`org.apache.hudi.client.transaction.TransactionManager#endTransaction` (in a 
`finally`), the `HoodieException("Timer was not started")` exception was raised 
instead, suppressing the exception from (3), which is the actual root cause 
issue. Instead, the execution of 
`org.apache.hudi.client.transaction.TransactionManager#endTransaction` should 
have executed without throwing this additional exception, which would have lead 
the caller to throw the exception in (3) before exiting.

Although resolving this would not prevent the overall operation from failing, 
it would provide better observability on the actual root cause exception (the 
one from (3)). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to