[ 
https://issues.apache.org/jira/browse/ACCUMULO-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012720#comment-15012720
 ] 

Josh Elser commented on ACCUMULO-4060:
--------------------------------------

I can see some options here to fix this:

1. Tank the JVM if a repo-runner thread dies. This is quick and dirty (often 
how we tend to react to zookeeper issues) and could result in operator 
headache, having to restart the Master.
2. Re-launch repo-runner threads to always reach the desired number of threads. 
This will help keep the master running in the transient-connection-issue cases, 
but it could potentially mask a bigger issue that does require human 
interaction (of course, is not reading the logs and expecting the process to 
die really a good excuse?)
3. Something else entirely?

[~kturner], [~ecn], [~ctubbsii], your opinions would be appreciated.

> Transient ZooKeeper connection issues kills FATE Runner threads
> ---------------------------------------------------------------
>
>                 Key: ACCUMULO-4060
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4060
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.7.1, 1.8.0
>
>
> Noticed this the following on a 6 node Accumulo cluster with Kerberos and 
> quality of protection set to auth-conf (wire encryption). The cluster 
> appeared to be up and running -- healthy. Attempts to create a table via the 
> shell was hung in the CreateTableCommand, polling on the FATE operation. 
> After a few minutes, it made no progress.
> Inspecting the FATE transactions showed that there were (multiple) FATE ops 
> running, but none where locked or locking any tables, nor making any progress.
> This lead me to inspect the Master's log to figure out why it wasn't making 
> any progress, and, to my joy, I found the following:
> {noformat}
> 2015-11-18 23:18:30,784 [fate.Fate] ERROR: Thread "Repo runner 0" died 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,783 [fate.Fate] ERROR: Thread "Repo runner 2" died 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 1" died 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> 2015-11-18 23:18:30,787 [fate.Fate] ERROR: Thread "Repo runner 3" died 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
> java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:189)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:158)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:60)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for 
> /accumulo/a1af6ffa-720b-4ec3-8198-5891010294a5/fate
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>         at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:151)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:128)
>         ... 6 more
> {noformat}
> This happened at the end of a ~30s period of difficulties in the Master 
> communicating with ZooKeeper. I've yet to investigate why this pause 
> happened, but the fact that the FATE runner threads died and the Master kept 
> running is no good.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to