[ 
https://issues.apache.org/jira/browse/ACCUMULO-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140704#comment-14140704
 ] 

Josh Elser edited comment on ACCUMULO-3148 at 9/19/14 3:12 PM:
---------------------------------------------------------------

On the contrary, I don't see anything in the master log which indicates that 
the master killed it. The log message is triggered after Watcher fires on the 
znode for this tserver. The data is empty, so the master transitions it into 
the dead tservers set.

{noformat}
2014-09-15 09:40:16,024 [master.Master] WARN : Lost servers 
[ip-172-31-33-94:40793[14878ae7b920006]]
2014-09-15 09:40:16,024 [master.EventCoordinator] INFO : There are now 0 tablet 
servers
{noformat}

The above happens in the middle of the tserver "Sleeping" block. When it wakes 
up, it notices that it lost its lock.

{noformat}
2014-09-15 09:40:20,088 [tserver.TabletServer] FATAL: Lost tablet server lock 
(reason = LOCK_DELETED), exiting.
{noformat}

The master did log a few SocketTimeoutExceptions, but I don't see any 
indication that it actively killed the server, rather it died on its own 
(unless our logging is insufficient in what you're referencing).


was (Author: elserj):
On the contrary, I don't see anything in the master log which indicates that 
the master killed it

{noformat}
2014-09-15 09:40:16,024 [master.Master] WARN : Lost servers 
[ip-172-31-33-94:40793[14878ae7b920006]]
2014-09-15 09:40:16,024 [master.EventCoordinator] INFO : There are now 0 tablet 
servers
{noformat}

The above happens in the middle of the tserver "Sleeping" block. When it wakes 
up, it notices that it lost its lock

{noformat}
2014-09-15 09:40:20,088 [tserver.TabletServer] FATAL: Lost tablet server lock 
(reason = LOCK_DELETED), exiting.
{noformat}

The master did log a few SocketTimeoutExceptions, but I don't see any 
indication that it actively killed the server, rather it died on its own 
(unless our logging is insufficient in what you're referencing).

> TabletServer didn't get Session expired in HalfDeadTServerIT
> ------------------------------------------------------------
>
>                 Key: ACCUMULO-3148
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3148
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.6.1, 1.7.0
>
>
> Beening seeing spurious failures with HalfDeadTServerIT where it doesn't get 
> the ZK session expiration
> {noformat}
> 2014-09-15 09:39:59,201 [tserver.TabletServer] DEBUG: ScanSess tid 
> 172.31.33.94:35957 !0 0 entries in 0.07 secs, nbTimes = [63 63 63.00 1] 
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> sleeping
> 2014-09-15 09:40:20,088 [tserver.TabletServer] FATAL: Lost tablet server lock 
> (reason = LOCK_DELETED), exiting.
> 2014-09-15 09:40:20,088 [zookeeper.ZooCache] WARN : Zookeeper error, will 
> retry
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /accumulo/d0b9b8e7-9869-4b00-9ae7-317f5231f2c1/tables/1/conf/table.iterator.minc.vers.opt.maxVersions
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>       at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:261)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:153)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:277)
>       at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:224)
>       at 
> org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:114)
>       at 
> org.apache.accumulo.server.conf.ZooCachePropertyAccessor.getProperties(ZooCachePropertyAccessor.java:144)
>       at 
> org.apache.accumulo.server.conf.TableConfiguration.getProperties(TableConfiguration.java:108)
>       at 
> org.apache.accumulo.core.conf.AccumuloConfiguration.iterator(AccumuloConfiguration.java:69)
>       at 
> org.apache.accumulo.core.conf.ConfigSanityCheck.validate(ConfigSanityCheck.java:40)
>       at 
> org.apache.accumulo.server.conf.ServerConfigurationFactory.getTableConfiguration(ServerConfigurationFactory.java:155)
>       at 
> org.apache.accumulo.server.conf.ServerConfiguration.getTableConfiguration(ServerConfiguration.java:69)
>       at 
> org.apache.accumulo.tserver.TabletServer.getTableConfiguration(TabletServer.java:3983)
>       at org.apache.accumulo.tserver.Tablet.<init>(Tablet.java:1277)
>       at org.apache.accumulo.tserver.Tablet.<init>(Tablet.java:1256)
>       at org.apache.accumulo.tserver.Tablet.<init>(Tablet.java:1112)
>       at org.apache.accumulo.tserver.Tablet.<init>(Tablet.java:1089)
>       at 
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2935)
>       at 
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>       at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>       at 
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>       at java.lang.Thread.run(Thread.java:745)
> 2014-09-15 09:40:20,090 [tserver.TabletServer] WARN : Check for long GC 
> pauses not called in a timely fashion. Expected every 5.0 seconds but was 
> 16.3 seconds since last check
> 2014-09-15 09:40:20,477 [datanode.DataNode] ERROR: 
> 127.0.0.1:57185:DataXceiver error processing WRITE_BLOCK operation  src: 
> /127.0.0.1:42146 dst: /127.0.0.1:57185
> java.io.IOException: Premature EOF from inputStream
>       at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:771)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:718)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:126)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:72)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> It looks like the tserver killed itself after the connection loss but before 
> the tserver retried to connect and got the session expiration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to