[
https://issues.apache.org/jira/browse/ACCUMULO-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012720#comment-15012720
]
Josh Elser commented on ACCUMULO-4060:
--------------------------------------
I can see some options here to fix this:
1. Tank the JVM if a repo-runner thread dies. This is quick and dirty (often
how we tend to react to zookeeper issues) and could result in operator
headache, having to restart the Master.
2. Re-launch repo-runner threads to always reach the desired number of threads.
This will help keep the master running in the transient-connection-issue cases,
but it could potentially mask a bigger issue that does require human
interaction (of course, is not reading the logs and expecting the process to
die really a good excuse?)
3. Something else entirely?
[~kturner], [~ecn], [~ctubbsii], your opinions would be appreciated.
> Transient ZooKeeper connection issues kills FATE Runner threads
> ---------------------------------------------------------------
>
> Key: ACCUMULO-4060
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4060
> Project: Accumulo
> Issue Type: Bug
> Components: fate, master
> Reporter: Josh Elser
> Assignee: Josh Elser
> Fix For: 1.7.1, 1.8.0
>
>
> Noticed this the following on a 6 node Accumulo cluster with Kerberos and
> quality of protection set to auth-conf (wire encryption). The cluster
> appeared to be up and running -- healthy. Attempts to create a table via the
> shell was hung in the CreateTableCommand, polling on the FATE operation.
> After a few minutes, it made no progress.
> Inspecting the FATE transactions showed that there were (multiple) FATE ops
> running, but none where locked or locking any tables, nor making any progress.
> This lead me to inspect the Master's log to figure out why it wasn't making
> any progress, and, to my joy, I found the following:
> {noformat}
> 2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> at
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
> ... 6 more
> 2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> at
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
> ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> at
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
> ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> at
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
> ... 6 more
> {noformat}
> This happened at the end of a ~30s period of difficulties in the Master
> communicating with ZooKeeper. I've yet to investigate why this pause
> happened, but the fact that the FATE runner threads died and the Master kept
> running is no good.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)