Josh Elser created ACCUMULO-4060:
------------------------------------
Summary: Transient ZooKeeper connection issues kills FATE Runner
threads
Key: ACCUMULO-4060
URL: https://issues.apache.org/jira/browse/ACCUMULO-4060
Project: Accumulo
Issue Type: Bug
Components: fate, master
Reporter: Josh Elser
Assignee: Josh Elser
Fix For: 1.7.1, 1.8.0
Noticed this the following on a 6 node Accumulo cluster with Kerberos and
quality of protection set to auth-conf (wire encryption). The cluster appeared
to be up and running -- healthy. Attempts to create a table via the shell was
hung in the CreateTableCommand, polling on the FATE operation. After a few
minutes, it made no progress.
Inspecting the FATE transactions showed that there were (multiple) FATE ops
running, but none where locked or locking any tables, nor making any progress.
This lead me to inspect the Master's log to figure out why it wasn't making any
progress, and, to my joy, I found the following:
{noformat}
2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
java.lang.RuntimeException:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
at
org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
... 6 more
2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
java.lang.RuntimeException:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
at
org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
... 6 more
2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
java.lang.RuntimeException:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
at
org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
... 6 more
2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
java.lang.RuntimeException:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
at
org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
... 6 more
{noformat}
This happened at the end of a ~30s period of difficulties in the Master
communicating with ZooKeeper. I've yet to investigate why this pause happened,
but the fact that the FATE runner threads died and the Master kept running is
no good.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)