[
https://issues.apache.org/jira/browse/HIVE-25663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HIVE-25663:
----------------------------------
Labels: pull-request-available (was: )
> Need to modify table/partition lock acquisition retry for Zookeeper lock
> manager
> --------------------------------------------------------------------------------
>
> Key: HIVE-25663
> URL: https://issues.apache.org/jira/browse/HIVE-25663
> Project: Hive
> Issue Type: Improvement
> Components: Locking
> Reporter: Eugene Chung
> Assignee: Eugene Chung
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2021-10-30-11-54-42-164.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> {code:java}
> LOCK TABLE default.my_table PARTITION (log_date='2021-10-30') EXCLUSIVE;
> SET hive.query.timeout.seconds=5;
> SELECT * FROM default.my_table WHERE log_date='2021-10-30' LIMIT 10;
> {code}
> If you execute the three SQLs above in the same session, the last SELECT will
> be cancelled by timeout error. The problem is that when you execute 'show
> locks', you will see a SHARED lock of default.my_table which is remained for
> 100 minutes, if you are using ZooKeeperHiveLockManager.
> !image-2021-10-30-11-54-42-164.png|width=873,height=411!
> I am going to explain the problem one by one.
>
> The SELECT SQL which gets some data from a partitioned table
> {code:java}
> SELECT * FROM my_table WHERE log_date='2021-10-30' LIMIT 10{code}
> needs two SHARED locks in order. The two SHARED locks are
> * default.my_table
> * default.my_table@log_date=2021-10-30
> Before executing the SQL, an EXCLUSIVE lock of the partition exists. We can
> simulate it easily with a DDL like below;
> {code:java}
> LOCK TABLE default.my_table PARTITION (log_date='2021-10-30') EXCLUSIVE{code}
>
> The SELECT SQL acquires the SHARED lock of the table, but it can't acquire
> the SHARED lock of the partition. It retries to acquire it as specified by
> two configurations. The default values mean it will retry for 100 minutes.
> * hive.lock.sleep.between.retries=60s
> * hive.lock.numretries=100
>
> If query.timeout is set to 5 seconds, the SELECT SQL is cancelled 5 seconds
> later and the client returns with timeout error. But the SHARED lock of the
> my_table is still remained for 100 minutes, because [the current
> ZooKeeperHiveLockManager just logs
> InterruptedException|https://github.com/apache/hive/blob/8a8e03d02003aa3543f46f595b4425fd8c156ad9/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/zookeeper/ZooKeeperHiveLockManager.java#L326]
> and still goes on lock retry. This also means that the SQL processing thread
> is still doing its job for 100 minutes even though the SQL is cancelled. If
> the same SQL is executed 3 times, you can see 3 threads each of which thread
> dump is like below;
> {code:java}
> "HiveServer2-Background-Pool: Thread-154" #154 prio=5 os_prio=0
> tid=0x00007f0ac91cb000 nid=0x13d25 waiting on condition [0x000
> 07f0aa2ce2000]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager.lock(ZooKeeperHiveLockManager.java:303)
> at
> org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager.lock(ZooKeeperHiveLockManager.java:207)
> at
> org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager.acquireLocks(DummyTxnManager.java:199)
> at org.apache.hadoop.hive.ql.Driver.acquireLocks(Driver.java:1610)
> at org.apache.hadoop.hive.ql.Driver.lockAndRespond(Driver.java:1796)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1966)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1710)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1704)
> at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157)
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:217)
> at
> org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:87)
> at
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:309)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:322)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
>
> I think ZooKeeperHiveLockManager should not swallow the unexpected exceptions.
> It should only retry for expected ones.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)