Eric Newton created ACCUMULO-1740:
-------------------------------------

             Summary: intermittent integration test failure
                 Key: ACCUMULO-1740
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1740
             Project: Accumulo
          Issue Type: Bug
          Components: test
            Reporter: Eric Newton
            Assignee: Eric Newton


Some of the recovery integration tests fail with a very long timeout (10 
minutes).

After a restart of the tablet servers, the WAL is sorted, and the root tablet 
is assigned.  After that, the master does not assign the !METADATA tablets.

I've managed to jstack the master, and it seems to be stuck scanning.  I turned 
on DEBUG log messages and I see this:
{noformat}
2013-09-25 17:27:46,340 [impl.TabletServerBatchReaderIterator] DEBUG: Server : 
rd6ul-14706v.tycho.ncsc.mil:37957 msg : java.net.SocketTimeoutException: 120000 
millis timeout while waiting for channel to be ready for
 read. ch : java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362 
remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
2013-09-25 17:27:46,340 [impl.TabletServerBatchReaderIterator] DEBUG: 
org.apache.thrift.transport.TTransportException: 
java.net.SocketTimeoutException: 120000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
java.io.IOException: org.apache.thrift.transport.TTransportException: 
java.net.SocketTimeoutException: 120000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
        at 
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:705)
        at 
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:364)
        at 
org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at 
org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
        at 
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.thrift.transport.TTransportException: 
java.net.SocketTimeoutException: 120000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
        at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at 
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
        at 
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at 
org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.readAll(ThriftTransportPool.java:254)
        at 
org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:601)
        at 
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:470)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
        at 
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:310)
        at 
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:290)
        at 
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:650)
        ... 7 more
Caused by: java.net.SocketTimeoutException: 120000 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362 
remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
        ... 18 more
{noformat}

The tablet server does put the root tablet online.

There are 8 tests that restart tablet servers, this usually only happens to one 
of the tests per run, making it difficult to track down.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to