[
https://issues.apache.org/jira/browse/HIVE-16487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976624#comment-15976624
]
Chaoyu Tang commented on HIVE-16487:
------------------------------------
[~pvary], your analysis makes sense. Thanks.
> Serious Zookeeper exception is logged when a race condition happens
> -------------------------------------------------------------------
>
> Key: HIVE-16487
> URL: https://issues.apache.org/jira/browse/HIVE-16487
> Project: Hive
> Issue Type: Bug
> Components: Locking
> Affects Versions: 3.0.0
> Reporter: Peter Vary
> Assignee: Peter Vary
>
> A customer started to see this in the logs, but happily everything was
> working as intended:
> {code}
> 2017-03-30 12:01:59,446 ERROR ZooKeeperHiveLockManager:
> [HiveServer2-Background-Pool: Thread-620]: Serious Zookeeper exception:
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for /hive_zookeeper_namespace/<TABLE_NAME>/LOCK-SHARED-
> {code}
> This was happening, because a race condition between the lock releasing, and
> lock acquiring. The thread releasing the lock removes the parent ZK node just
> after the thread acquiring the lock made sure, that the parent node exists.
> Since this can happen without any real problem, I plan to add NODEEXISTS, and
> NONODE as a transient ZooKeeper exception, so the users are not confused.
> Also, the original author of ZooKeeperHiveLockManager maybe planned to handle
> different ZooKeeperExceptions differently, and the code is hard to
> understand. See the {{continue}} and the {{break}}. The {{break}} only breaks
> the switch, and not the loop which IMHO is not intuitive:
> {code}
> do {
> try {
> [..]
> ret = lockPrimitive(key, mode, keepAlive, parentCreated,
> } catch (Exception e1) {
> if (e1 instanceof KeeperException) {
> KeeperException e = (KeeperException) e1;
> switch (e.code()) {
> case CONNECTIONLOSS:
> case OPERATIONTIMEOUT:
> LOG.debug("Possibly transient ZooKeeper exception: ", e);
> continue;
> default:
> LOG.error("Serious Zookeeper exception: ", e);
> break;
> }
> }
> [..]
> }
> } while (tryNum < numRetriesForLock);
> {code}
> If we do not want to try again in case of a "Serious Zookeeper exception:",
> then we should add a label to the do loop, and break it in the switch.
> If we do want to try regardless of the type of the ZK exception, then we
> should just change the {{continue;}} to {{break;}} and move the lines part of
> the code which did not run in case of {{continue}} to the {{default}} switch,
> so it is easier to understand the code.
> Any suggestions or ideas [~ctang.ma] or [~szehon]?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)