[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616277#comment-15616277
 ] 

Thejas M Nair commented on HIVE-14979:
--------------------------------------

The current approach of cleanup on restart relies on the fact that the restart 
happens on same node. In case of cloud environments, there are more frequent 
instances of nodes going down. In case of on-prem instances, a node having 
hardware failure could result in that node/ip not being available for some 
time. A new HS2 instances might get started on a different node with a 
different IP address. 
Also, the current approach doesn't handle the case of multiple instances of HS2 
running on the same host.

I think going with [persistent 
ephemeral|http://curator.apache.org/curator-recipes/persistent-ephemeral-node.html]
 nodes is better approach. That approach is also as resilient as I wish it 
would be, because the fact that this curator recipe exists, also shows that 
there is some flakiness around nodes being around when it should be. So I think 
we should still keep the session.timeout in order of minutes.


Regarding the session timeout -
Looks like the original setting for the session timeout was 10 mins, and 
HIVE-9119 changed it to 20 mins. 
In case of zookeeper service discovery, it is not a major issue if the entry in 
zookeeper stays around for longer. Larger timeout can provide better resilience 
against temporary gc or network issues. 10 mins might be still OK for this 
purpose.

However, in case of the locks we want to wait as little as possible before 
cleanup, so that in case of improper shutdown, we can cleanup the entries 
sooner. I think we still would want it to be couple of minutes for the sake of 
resiliency. 
Since the requirements are different we could create separate config for the 
lock zk session timeout.



> Removing stale Zookeeper locks at HiveServer2 initialization
> ------------------------------------------------------------
>
>                 Key: HIVE-14979
>                 URL: https://issues.apache.org/jira/browse/HIVE-14979
>             Project: Hive
>          Issue Type: Improvement
>          Components: Locking
>            Reporter: Peter Vary
>            Assignee: Peter Vary
>         Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, 
> HIVE-14979.5.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to