[
https://issues.apache.org/jira/browse/ACCUMULO-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014670#comment-15014670
]
ASF GitHub Bot commented on ACCUMULO-4060:
------------------------------------------
Github user joshelser commented on the pull request:
https://github.com/apache/accumulo/pull/52#issuecomment-158227531
> Seems like it would be simpler to modify transaction runner and add a
try/catch/log just inside the while loop.
We very well could do this as well. I was hoping to pick your brain on any
worries in just eating those exceptions. I suppose in the end it's no different.
> Transient ZooKeeper connection issues kills FATE Runner threads
> ---------------------------------------------------------------
>
> Key: ACCUMULO-4060
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4060
> Project: Accumulo
> Issue Type: Bug
> Components: fate, master
> Reporter: Josh Elser
> Assignee: Josh Elser
> Fix For: 1.7.1, 1.8.0
>
>
> Noticed this the following on a 6 node Accumulo cluster with Kerberos and
> quality of protection set to auth-conf (wire encryption). The cluster
> appeared to be up and running -- healthy. Attempts to create a table via the
> shell was hung in the CreateTableCommand, polling on the FATE operation.
> After a few minutes, it made no progress.
> Inspecting the FATE transactions showed that there were (multiple) FATE ops
> running, but none where locked or locking any tables, nor making any progress.
> This lead me to inspect the Master's log to figure out why it wasn't making
> any progress, and, to my joy, I found the following:
> {noformat}
> 2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> at
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
> ... 6 more
> 2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> at
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
> ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> at
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
> ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
> at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
> at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> at
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
> at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
> ... 6 more
> {noformat}
> This happened at the end of a ~30s period of difficulties in the Master
> communicating with ZooKeeper. I've yet to investigate why this pause
> happened, but the fact that the FATE runner threads died and the Master kept
> running is no good.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)