[ https://issues.apache.org/jira/browse/ACCUMULO-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Elser resolved ACCUMULO-3336. ---------------------------------- Resolution: Cannot Reproduce Fix Version/s: (was: 1.8.1) This issue is getting far too much traction from users who think that this is a bug. The issue I reported here is not a simple SESSION_EXPIRED from ZooKeeper. Unless you really understand what's going on, you need to tune the configuration of Accumulo, HDFS or the Operation System. This is not a "bug". > ZK session reconnect still results in loss of ZK lock > ----------------------------------------------------- > > Key: ACCUMULO-3336 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3336 > Project: Accumulo > Issue Type: Bug > Components: zookeeper > Affects Versions: 1.5.2, 1.6.1 > Reporter: Josh Elser > > Saw the following > {noformat} > 2014-11-14 08:38:30,612 [zookeeper.DistributedWorkQueue] INFO : Got > unexpected zookeeper event: None for > /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq > 2014-11-14 08:38:30,621 [zookeeper.ZooCache] WARN : Saw (possibly) transient > exception communicating with ZooKeeper, will retry > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for > /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/config/tserver.compaction.warn.time > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260) > at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232) > at > org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:96) > at > org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:65) > at > org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:90) > at > org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:136) > at > org.apache.accumulo.tserver.CompactionWatcher.run(CompactionWatcher.java:84) > at > org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > 2014-11-14 08:38:30,672 [zookeeper.DistributedWorkQueue] INFO : Got > unexpected zookeeper event: None for > /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery > 2014-11-14 08:38:30,672 [zookeeper.ZooLock] DEBUG: event null None > Disconnected > 2014-11-14 08:38:31,484 [zookeeper.ZooReader] WARN : Saw (possibly) transient > exception communicating with ZooKeeper > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for > /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tservers/ip-172-31-13-177:37709 > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > at > org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:109) > at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:381) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > 2014-11-14 08:38:31,484 [zookeeper.ZooCache] WARN : Saw (possibly) transient > exception communicating with ZooKeeper, will retry > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for > /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tables/!0/namespace > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260) > at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232) > at > org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:304) > at > org.apache.accumulo.server.conf.TableParentConfiguration.getNamespaceId(TableParentConfiguration.java:47) > at > org.apache.accumulo.server.conf.NamespaceConfiguration.getPath(NamespaceConfiguration.java:85) > at > org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:98) > at > org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:107) > at > org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:103) > at > org.apache.accumulo.core.conf.AccumuloConfiguration.getCount(AccumuloConfiguration.java:193) > at > org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:2636) > at > org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > at java.lang.Thread.run(Thread.java:745) > 2014-11-14 08:38:31,484 [zookeeper.Retry] DEBUG: Sleeping for 250ms before > retrying operation > 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Removing closed > ZooKeeper session to localhost:12644 > 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Connecting to > localhost:12644 with timeout 30000 with auth > 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Removing closed > ZooKeeper session to localhost:12644 > 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Connecting to > localhost:12644 with timeout 30000 with auth > 2014-11-14 08:38:31,692 [tserver.TabletServer] DEBUG: gc ParNew=0.10(+0.04) > secs ConcurrentMarkSweep=0.05(+0.00) secs freemem=118,013,904(+6,412,200) > totalmem=129,761,280 > 2014-11-14 08:38:31,692 [tserver.TabletServer] WARN : GC pause checker not > called in a timely fashion. Expected every 5.0 seconds but was 43.1 seconds > since last check > 2014-11-14 08:38:31,700 [zookeeper.DistributedWorkQueue] DEBUG: Looking for > work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq > 2014-11-14 08:38:31,701 [zookeeper.DistributedWorkQueue] DEBUG: Looking for > work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery > 2014-11-14 08:38:31,715 [tserver.TabletServer] DEBUG: ScanSess tid > 172.31.13.177:35935 !0 1 entries in 0.03 secs, nbTimes = [24 24 24.00 1] > 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Scanning trace hosts in > zookeeper: /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tracers > 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Trace hosts: [] > 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got > unexpected zookeeper event: None for > /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/replication/workqueue > 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got > unexpected zookeeper event: None for > /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq > 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got > unexpected zookeeper event: None for > /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery > 2014-11-14 08:38:31,739 [zookeeper.ZooSession] DEBUG: Session expired, state > of current session : Expired > 2014-11-14 08:38:31,739 [zookeeper.ZooLock] DEBUG: event null None Expired > 2014-11-14 08:38:31,741 [tserver.TabletServer] FATAL: Lost tablet server lock > (reason = SESSION_EXPIRED), exiting. > {noformat} > ZooKeeper code appears to had disconnected, closed the disconnected > connection and then opened a new session. However, the ZooLock, IIRC, didn't > reconnect and hung the tserver. > If we want to support this, it might require rehashing some of the ZooLock > code (to prevent the tserver from processing while the tserver doesn't have > its lock). -- This message was sent by Atlassian JIRA (v6.3.4#6332)