[jira] [Commented] (HIVE-28565) Reduce lock.sleep.duration.between.retries for tests

Stamatis Zampetakis (Jira) Thu, 10 Oct 2024 06:54:05 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-28565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888294#comment-17888294
 ]


Stamatis Zampetakis commented on HIVE-28565:
--------------------------------------------

In fact this was already done in HIVE-6464 but since then the configuration 
files multiplied so the PR aims to cover cases where the properties are not set.

> Reduce lock.sleep.duration.between.retries for tests
> ----------------------------------------------------
>
>                 Key: HIVE-28565
>                 URL: https://issues.apache.org/jira/browse/HIVE-28565
>             Project: Hive
>          Issue Type: Task
>      Security Level: Public(Viewable by anyone) 
>          Components: Tests
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>
> The default value for hive.lock.numretries/metastore.lock.numretries property 
> is 100. In combination with hive.lock.sleep.between.retries property set to 
> 60s can keep a test running and retrying for ~1.6hours (6000s).
> In normal circumstances tests should obtain a lock rapidly but if something 
> goes wrong then waiting ~1.6 hours just to see the test fail is unacceptable. 
> I've hit this situation a couple of times and more recently in 
> https://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-5249/12/tests/
>  where 
> TestCrudCompactorOnTez#testRebalanceCompactionWithParallelDeleteAsSecondPessimisticLock
>  kept running for 1h 46m.
> {noformat}
> 2024-10-08T14:36:15,954 ERROR [main] lockmgr.DbLockManager: Unable to acquire 
> locks for lockId=19 after 101 retries (retries took 6343541 ms). 
> QueryId=jenkins_20241008124941_1f2b0bba-f6e6-4def-b8d9-41f4ff318854
> LockResponse(lockid:19, state:WAITING)
> FAILED: Error in acquiring locks: Lock acquisition for 
> LockRequest(component:[LockComponent(type:SHARED_WRITE, level:TABLE, 
> dbname:default, tablename:rebalance_test, operationType:DELETE, 
> isTransactional:true, isDynamicPartitionWrite:false)], txnid:19, 
> user:jenkins, hostname:hive-precommit-pr-5249-12-kztld-624b4-hng5v, 
> agentInfo:jenkins_20241008124941_1f2b0bba-f6e6-4def-b8d9-41f4ff318854, 
> zeroWaitReadEnabled:true, exclusiveCTAS:false) timed out after 6343541ms.  
> LockResponse(lockid:19, state:WAITING)
> 2024-10-08T14:36:15,967 ERROR [main] ql.Driver: FAILED: Error in acquiring 
> locks: Lock acquisition for 
> LockRequest(component:[LockComponent(type:SHARED_WRITE, level:TABLE, 
> dbname:default, tablename:rebalance_test, operationType:DELETE, 
> isTransactional:true, isDynamicPartitionWrite:false)], txnid:19, 
> user:jenkins, hostname:hive-precommit-pr-5249-12-kztld-624b4-hng5v, 
> agentInfo:jenkins_20241008124941_1f2b0bba-f6e6-4def-b8d9-41f4ff318854, 
> zeroWaitReadEnabled:true, exclusiveCTAS:false) timed out after 6343541ms.  
> LockResponse(lockid:19, state:WAITING)
> org.apache.hadoop.hive.ql.lockmgr.LockException: Lock acquisition for 
> LockRequest(component:[LockComponent(type:SHARED_WRITE, level:TABLE, 
> dbname:default, tablename:rebalance_test, operationType:DELETE, 
> isTransactional:true, isDynamicPartitionWrite:false)], txnid:19, 
> user:jenkins, hostname:hive-precommit-pr-5249-12-kztld-624b4-hng5v, 
> agentInfo:jenkins_20241008124941_1f2b0bba-f6e6-4def-b8d9-41f4ff318854, 
> zeroWaitReadEnabled:true, exclusiveCTAS:false) timed out after 6343541ms.  
> LockResponse(lockid:19, state:WAITING)
>       at 
> org.apache.hadoop.hive.ql.lockmgr.DbLockManager.lock(DbLockManager.java:155)
>       at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocks(DbTxnManager.java:464)
>       at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocksWithHeartbeatDelay(DbTxnManager.java:498)
>       at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocks(DbTxnManager.java:295)
>       at 
> org.apache.hadoop.hive.ql.lockmgr.HiveTxnManagerImpl.acquireLocks(HiveTxnManagerImpl.java:81)
>       at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocks(DbTxnManager.java:100)
>       at 
> org.apache.hadoop.hive.ql.DriverTxnHandler.acquireLocksInternal(DriverTxnHandler.java:338)
>       at 
> org.apache.hadoop.hive.ql.DriverTxnHandler.acquireLocks(DriverTxnHandler.java:240)
>       at 
> org.apache.hadoop.hive.ql.DriverTxnHandler.acquireLocksIfNeeded(DriverTxnHandler.java:147)
>       at org.apache.hadoop.hive.ql.Driver.lockAndRespond(Driver.java:335)
>       at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:179)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:143)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:130)
>       at 
> org.apache.hadoop.hive.ql.txn.compactor.TestCompactorBase.executeStatementOnDriver(TestCompactorBase.java:171)
>       at 
> org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez.testRebalanceCompactionWithParallelDeleteAsSecond(TestCrudCompactorOnTez.java:143)
>       at 
> org.apache.hadoop.hive.ql.txn.compactor.TestCrudCompactorOnTez.testRebalanceCompactionWithParallelDeleteAsSecondPessimisticLock(TestCrudCompactorOnTez.java:102)
> {noformat}
> I propose to set the respective properties to some small value (i.e., 5) when 
> running the tests to fail fast when there is an issue to obtain a lock and 
> don't waste resources for nothing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-28565) Reduce lock.sleep.duration.between.retries for tests

Reply via email to