Eric Newton created ACCUMULO-1740:
-------------------------------------
Summary: intermittent integration test failure
Key: ACCUMULO-1740
URL: https://issues.apache.org/jira/browse/ACCUMULO-1740
Project: Accumulo
Issue Type: Bug
Components: test
Reporter: Eric Newton
Assignee: Eric Newton
Some of the recovery integration tests fail with a very long timeout (10
minutes).
After a restart of the tablet servers, the WAL is sorted, and the root tablet
is assigned. After that, the master does not assign the !METADATA tablets.
I've managed to jstack the master, and it seems to be stuck scanning. I turned
on DEBUG log messages and I see this:
{noformat}
2013-09-25 17:27:46,340 [impl.TabletServerBatchReaderIterator] DEBUG: Server :
rd6ul-14706v.tycho.ncsc.mil:37957 msg : java.net.SocketTimeoutException: 120000
millis timeout while waiting for channel to be ready for
read. ch : java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362
remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
2013-09-25 17:27:46,340 [impl.TabletServerBatchReaderIterator] DEBUG:
org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: 120000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
java.io.IOException: org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: 120000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:705)
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:364)
at
org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at
org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
at
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: 120000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at
org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.readAll(ThriftTransportPool.java:254)
at
org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:601)
at
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:470)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:310)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:290)
at
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:650)
... 7 more
Caused by: java.net.SocketTimeoutException: 120000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362
remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
... 18 more
{noformat}
The tablet server does put the root tablet online.
There are 8 tests that restart tablet servers, this usually only happens to one
of the tests per run, making it difficult to track down.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira