[
https://issues.apache.org/jira/browse/ACCUMULO-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13778966#comment-13778966
]
Eric Newton commented on ACCUMULO-1740:
---------------------------------------
Jstack of the tablet servers shows scans in progress:
{noformat}
"ClientPool 17" daemon prio=10 tid=0x00007f9bb8002000 nid=0x37af in
Object.wait() [0x00007f9b9fcfa000]
java.lang.Thread.State: RUNNABLE
at
org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:1329)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.accumulo.trace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63)
at $Proxy10.startMultiScan(Unknown Source)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2248)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2232)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at
org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:161)
at
org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478)
at
org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:213)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at
org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
at
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
at java.lang.Thread.run(Thread.java:662)
{noformat}
... many threads ...
{noformat}
"ClientPool 74" daemon prio=10 tid=0x00007f9bb8001000 nid=0x74cb in
Object.wait() [0x00007f9c1c22c000]
java.lang.Thread.State: RUNNABLE
at
org.apache.accumulo.server.security.AuditedSecurityOperation.canScan(AuditedSecurityOperation.java:147)
at
org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.startScan(TabletServer.java:1161)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.accumulo.trace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63)
at $Proxy10.startScan(Unknown Source)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startScan.getResult(TabletClientService.java:2173)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startScan.getResult(TabletClientService.java:2157)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at
org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:161)
at
org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478)
at
org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:213)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at
org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
at
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
at java.lang.Thread.run(Thread.java:662)
{noformat}
Seems to happen only under heavy loads. I'm seeing it while experimenting with
running 8+ tests simultaneously.
> intermittent integration test failure
> -------------------------------------
>
> Key: ACCUMULO-1740
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1740
> Project: Accumulo
> Issue Type: Bug
> Components: test
> Reporter: Eric Newton
> Assignee: Eric Newton
>
> Some of the recovery integration tests fail with a very long timeout (10
> minutes).
> After a restart of the tablet servers, the WAL is sorted, and the root tablet
> is assigned. After that, the master does not assign the !METADATA tablets.
> I've managed to jstack the master, and it seems to be stuck scanning. I
> turned on DEBUG log messages and I see this:
> {noformat}
> 2013-09-25 17:27:46,340 [impl.TabletServerBatchReaderIterator] DEBUG: Server
> : rd6ul-14706v.tycho.ncsc.mil:37957 msg : java.net.SocketTimeoutException:
> 120000 millis timeout while waiting for channel to be ready for
> read. ch : java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362
> remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
> 2013-09-25 17:27:46,340 [impl.TabletServerBatchReaderIterator] DEBUG:
> org.apache.thrift.transport.TTransportException:
> java.net.SocketTimeoutException: 120000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
> java.io.IOException: org.apache.thrift.transport.TTransportException:
> java.net.SocketTimeoutException: 120000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
> at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:705)
> at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:364)
> at
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> at
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.thrift.transport.TTransportException:
> java.net.SocketTimeoutException: 120000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at
> org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
> at
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at
> org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.readAll(ThriftTransportPool.java:254)
> at
> org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:601)
> at
> org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:470)
> at
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:310)
> at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:290)
> at
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:650)
> ... 7 more
> Caused by: java.net.SocketTimeoutException: 120000 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362
> remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
> ... 18 more
> {noformat}
> The tablet server does put the root tablet online.
> There are 8 tests that restart tablet servers, this usually only happens to
> one of the tests per run, making it difficult to track down.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira