[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616277#comment-15616277 ]
Thejas M Nair commented on HIVE-14979: -------------------------------------- The current approach of cleanup on restart relies on the fact that the restart happens on same node. In case of cloud environments, there are more frequent instances of nodes going down. In case of on-prem instances, a node having hardware failure could result in that node/ip not being available for some time. A new HS2 instances might get started on a different node with a different IP address. Also, the current approach doesn't handle the case of multiple instances of HS2 running on the same host. I think going with [persistent ephemeral|http://curator.apache.org/curator-recipes/persistent-ephemeral-node.html] nodes is better approach. That approach is also as resilient as I wish it would be, because the fact that this curator recipe exists, also shows that there is some flakiness around nodes being around when it should be. So I think we should still keep the session.timeout in order of minutes. Regarding the session timeout - Looks like the original setting for the session timeout was 10 mins, and HIVE-9119 changed it to 20 mins. In case of zookeeper service discovery, it is not a major issue if the entry in zookeeper stays around for longer. Larger timeout can provide better resilience against temporary gc or network issues. 10 mins might be still OK for this purpose. However, in case of the locks we want to wait as little as possible before cleanup, so that in case of improper shutdown, we can cleanup the entries sooner. I think we still would want it to be couple of minutes for the sake of resiliency. Since the requirements are different we could create separate config for the lock zk session timeout. > Removing stale Zookeeper locks at HiveServer2 initialization > ------------------------------------------------------------ > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking > Reporter: Peter Vary > Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, > HIVE-14979.5.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)