[
https://issues.apache.org/jira/browse/ACCUMULO-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196721#comment-14196721
]
Josh Elser commented on ACCUMULO-3296:
--------------------------------------
Found it:
{noformat}
Stat stat;
while (true) {
try {
stat = getZooKeeper(info).exists(zPath, null);
// Node exists
if (stat != null) {
try {
// Try to delete it. We don't care if there was an update to the
node
// since we got the Stat, just delete all versions (-1).
getZooKeeper(info).delete(zPath, -1);
return;
} catch (NoNodeException e) {
// If the node is gone now, it's ok if we have SKIP
if (policy.equals(NodeMissingPolicy.SKIP)) {
return;
}
throw e;
}
// Let other KeeperException bubble to the outer catch
}
} catch (KeeperException e) {
final Code c = e.code();
if (c == Code.CONNECTIONLOSS || c == Code.OPERATIONTIMEOUT || c ==
Code.SESSIONEXPIRED) {
retryOrThrow(retry, e);
} else {
throw e;
}
}
retry.waitForNextAttempt();
}
{noformat}
If {{stat}} is null, we'll sleep and retry indefinitely. I guess the tserver
removed itself and the node got cleaned up.
> Infinite ZK retry loop somewhere
> --------------------------------
>
> Key: ACCUMULO-3296
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3296
> Project: Accumulo
> Issue Type: Bug
> Components: master
> Reporter: Josh Elser
> Assignee: Josh Elser
> Fix For: 1.6.2, 1.7.0
>
>
> ShutdownIT-shutdownDuringQuery failed.
> The end of the master log had the following:
> {noformat}
> 2014-11-04 09:47:56,220 [master.LiveTServerSet] INFO : Removing zookeeper
> lock for tserver:39492[1497a3301100002]
> 2014-11-04 09:47:56,243 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:56,494 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:56,745 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:56,996 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:57,247 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:57,498 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:57,749 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:58,000 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:58,252 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:58,503 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:58,754 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:59,006 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:59,257 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:59,508 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:47:59,759 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:48:00,011 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:48:00,262 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> 2014-11-04 09:48:00,513 [zookeeper.Retry] DEBUG: Sleeping for 250ms before
> retrying operation
> {noformat}
> The Retry log message kept repeating until the test timed out. Every
> invocation of that sleep, should also include a message with the exception
> that was caught which caused us to perform this retry.
> It seems likely that recursiveDelete isn't doing something correctly given
> that was the last thing the Master was about to do.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)