[ 
https://issues.apache.org/jira/browse/ACCUMULO-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012884#comment-15012884
 ] 

Josh Elser commented on ACCUMULO-4060:
--------------------------------------

Looked into this a little bit more. I think what happened is that there was a 
rolling-restart of the ZooKeeper servers while Accumulo was still running. This 
was enough for the repo-runner threads to get a ConnectionLoss (more on this 
later), but the Master itself reconnected without losing its session (as we 
expect -- yay!).

Now, I believe the reason that the RepoRunner threads eventually tanked is 
because of this block in the {{reserve()}} method on {{ZooStore.java}}.

{code}
try {
 // ...
} catch (Exception e) {
  throw new RuntimeException(e);
}
{code}

Our ZooKeeper code (notably org.apache.accumulo.fate.zookeeper.ZooReader) does 
implicitly wrap these temporal ZK connection issues and retry them. By default, 
we retry 10 times and then propagate the exception. The RepoRunner Threads have 
no higher-level retry built into them, so, obviously, the thread dies when it 
gets the RTE.

> Transient ZooKeeper connection issues kills FATE Runner threads
> ---------------------------------------------------------------
>
>                 Key: ACCUMULO-4060
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4060
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.7.1, 1.8.0
>
>
> Noticed this the following on a 6 node Accumulo cluster with Kerberos and 
> quality of protection set to auth-conf (wire encryption). The cluster 
> appeared to be up and running -- healthy. Attempts to create a table via the 
> shell was hung in the CreateTableCommand, polling on the FATE operation. 
> After a few minutes, it made no progress.
> Inspecting the FATE transactions showed that there were (multiple) FATE ops 
> running, but none where locked or locking any tables, nor making any progress.
> This lead me to inspect the Master's log to figure out why it wasn't making 
> any progress, and, to my joy, I found the following:
> {noformat}
> 2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> {noformat}
> This happened at the end of a ~30s period of difficulties in the Master 
> communicating with ZooKeeper. I've yet to investigate why this pause 
> happened, but the fact that the FATE runner threads died and the Master kept 
> running is no good.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to