[jira] [Updated] (HIVE-25663) Need to modify table/partition lock acquisition retry for Zookeeper lock manager

ASF GitHub Bot (Jira) Fri, 29 Oct 2021 22:10:09 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-25663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HIVE-25663:
----------------------------------
    Labels: pull-request-available  (was: )

> Need to modify table/partition lock acquisition retry for Zookeeper lock 
> manager
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-25663
>                 URL: https://issues.apache.org/jira/browse/HIVE-25663
>             Project: Hive
>          Issue Type: Improvement
>          Components: Locking
>            Reporter: Eugene Chung
>            Assignee: Eugene Chung
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2021-10-30-11-54-42-164.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> LOCK TABLE default.my_table PARTITION (log_date='2021-10-30') EXCLUSIVE;
> SET hive.query.timeout.seconds=5;
> SELECT * FROM default.my_table WHERE log_date='2021-10-30' LIMIT 10;
> {code}
> If you execute the three SQLs above in the same session, the last SELECT will 
> be cancelled by timeout error. The problem is that when you execute 'show 
> locks', you will see a SHARED lock of default.my_table which is remained for 
> 100 minutes, if you are using ZooKeeperHiveLockManager.
> !image-2021-10-30-11-54-42-164.png|width=873,height=411!
> I am going to explain the problem one by one.
>  
> The SELECT SQL which gets some data from a partitioned table 
> {code:java}
> SELECT * FROM my_table WHERE log_date='2021-10-30' LIMIT 10{code}
> needs two SHARED locks in order. The two SHARED locks are
>  * default.my_table
>  * default.my_table@log_date=2021-10-30
> Before executing the SQL, an EXCLUSIVE lock of the partition exists. We can 
> simulate it easily with a DDL like below;
> {code:java}
> LOCK TABLE default.my_table PARTITION (log_date='2021-10-30') EXCLUSIVE{code}
>  
> The SELECT SQL acquires the SHARED lock of the table, but it can't acquire 
> the SHARED lock of the partition. It retries to acquire it as specified by 
> two configurations. The default values mean it will retry for 100 minutes.
>  * hive.lock.sleep.between.retries=60s
>  * hive.lock.numretries=100
>  
> If query.timeout is set to 5 seconds, the SELECT SQL is cancelled 5 seconds 
> later and the client returns with timeout error. But the SHARED lock of the 
> my_table is still remained for 100 minutes, because [the current 
> ZooKeeperHiveLockManager just logs 
> InterruptedException|https://github.com/apache/hive/blob/8a8e03d02003aa3543f46f595b4425fd8c156ad9/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/zookeeper/ZooKeeperHiveLockManager.java#L326]
>  and still goes on lock retry. This also means that the SQL processing thread 
> is still doing its job for 100 minutes even though the SQL is cancelled. If 
> the same SQL is executed 3 times, you can see 3 threads each of which thread 
> dump is like below; 
> {code:java}
> "HiveServer2-Background-Pool: Thread-154" #154 prio=5 os_prio=0 
> tid=0x00007f0ac91cb000 nid=0x13d25 waiting on condition [0x000
> 07f0aa2ce2000]
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
>  at java.lang.Thread.sleep(Native Method)
>  at 
> org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager.lock(ZooKeeperHiveLockManager.java:303)
>  at 
> org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager.lock(ZooKeeperHiveLockManager.java:207)
>  at 
> org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager.acquireLocks(DummyTxnManager.java:199)
>  at org.apache.hadoop.hive.ql.Driver.acquireLocks(Driver.java:1610)
>  at org.apache.hadoop.hive.ql.Driver.lockAndRespond(Driver.java:1796)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1966)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1710)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1704)
>  at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:217)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:87)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:309)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:322)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748){code}
>  
> I think ZooKeeperHiveLockManager should not swallow the unexpected exceptions.
> It should only retry for expected ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-25663) Need to modify table/partition lock acquisition retry for Zookeeper lock manager

Reply via email to