[ 
https://issues.apache.org/jira/browse/ACCUMULO-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604777#comment-14604777
 ] 

Steve Loughran commented on ACCUMULO-3914:
------------------------------------------

If HDFS threw this in the {{RetryInvocationHandler}}, then it made that 
decision based the outcome of the communication attempts and the retry policy. 
Clearly it decided that the exception wasn't going to be retried, yet, as it is 
one of those exceptions which the {{FailoverOnNetworkExceptionRetry}} policy 
will retry on, either a different policy was in use, or the client gave up 
trying.

Rather than wrap another catch/repeat layer on top, one that (in the patch) 
doesn't try and be discriminating against which exceptions are worth retrying, 
and which arent (security exceptions, io interrupted. ...,) I'd recommend 
making sure that {{"dfs.client.retry.policy.enabled"}} = true and look at some 
of the other tunable parameters. HDFS client should be able to adopt a retry 
policy that suits: if it doesn't, that's something where there may be scope for 
improvement. 

Summary: select the appropriate DFS client retry policy; ideally your tests 
should use the same ones you recommend for production.

> Restarting HDFS caused scan to fail 
> ------------------------------------
>
>                 Key: ACCUMULO-3914
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3914
>             Project: Accumulo
>          Issue Type: Bug
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.8.0
>
>         Attachments: ACCUMULO-3914-001.patch
>
>
> I was running random walk to test 1.6.3 RC1.   I had an incorrect hdfs 
> config.  I changed the hdfs config and restarted hdfs while the test was 
> running.   I would not have expected this to cause problems, but it caused 
> scans to fail.
> Below are client logs from RW.
> {noformat}
> 23 14:37:36,547 [randomwalk.Framework] ERROR: Error during random walk
> java.lang.Exception: Error running node Conditional.xml
>         at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:344)
>         at 
> org.apache.accumulo.test.randomwalk.Framework.run(Framework.java:63)
>         at 
> org.apache.accumulo.test.randomwalk.Framework.main(Framework.java:122)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.accumulo.start.Main$1.run(Main.java:141)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.Exception: Error running node ct.Transfer
>         at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:344)
>         at org.apache.accumulo.test.randomwalk.Module$1.call(Module.java:281)
>         at org.apache.accumulo.test.randomwalk.Module$1.call(Module.java:276)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at 
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         ... 1 more
> Caused by: java.lang.RuntimeException: 
> org.apache.accumulo.core.client.impl.AccumuloServerException: Error on server 
> worker0:9997
>         at 
> org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:187)
>         at 
> org.apache.accumulo.core.client.IsolatedScanner$RowBufferingIterator.readRow(IsolatedScanner.java:69)
>         at 
> org.apache.accumulo.core.client.IsolatedScanner$RowBufferingIterator.<init>(IsolatedScanner.java:148)
>         at 
> org.apache.accumulo.core.client.IsolatedScanner.iterator(IsolatedScanner.java:236)
>         at 
> org.apache.accumulo.test.randomwalk.conditional.Transfer.visit(Transfer.java:91)
>         ... 10 more
> Caused by: org.apache.accumulo.core.client.impl.AccumuloServerException: 
> Error on server worker0:9997
>         at 
> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:287)
>         at 
> org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:84)
>         at 
> org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:177)
>         ... 14 more
> Caused by: org.apache.thrift.TApplicationException: Internal error processing 
> startScan
>         at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
>         at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startScan(TabletClientService.java:228)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startScan(TabletClientService.java:204)
>         at 
> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:403)
>         at 
> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:279)
>         ... 16 more
> {noformat}
> Below are logs from the tserver.
> {noformat}
> 2015-06-23 14:37:36,553 [thrift.ProcessFunction] ERROR: Internal error 
> processing startScan
> org.apache.thrift.TException: java.util.concurrent.ExecutionException: 
> java.net.ConnectException: Call From worker0/10.1.5.184 to leader2:10000 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more detai
> ls see:  http://wiki.apache.org/hadoop/ConnectionRefused
>         at 
> org.apache.accumulo.server.util.RpcWrapper$1.invoke(RpcWrapper.java:51)
>         at com.sun.proxy.$Proxy17.startScan(Unknown Source)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startScan.getResult(TabletClientService.java:2179)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startScan.getResult(TabletClientService.java:2163)
>         at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>         at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>         at 
> org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:168)
>         at 
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516)
>         at 
> org.apache.accumulo.server.util.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:77)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at 
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:745)
> 2015-06-23 14:37:36,556 [tserver.TabletServer] WARN : exception while 
> scanning tablet b;b174<
> java.net.ConnectException: Call From worker0/10.1.5.184 to leader2:10000 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:254)
>         at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1220)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1210)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1200)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:271)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:238)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:231)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1498)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:302)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:298)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>         at 
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:263)
>         at 
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$100(CachableBlockFile.java:144)
>         at 
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$RawBlockLoader.get(CachableBlockFile.java:195)
>         at 
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:320)
>         at 
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getDataBlock(CachableBlockFile.java:400)
>         at 
> org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader.getDataBlock(RFile.java:590)
>         at 
> org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader._seek(RFile.java:715)
>         at 
> org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader.seek(RFile.java:607)
>         at 
> org.apache.accumulo.core.iterators.system.LocalityGroupIterator.seek(LocalityGroupIterator.java:138)
>         at 
> org.apache.accumulo.core.file.rfile.RFile$Reader.seek(RFile.java:980)
>         at 
> org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.readNext(SourceSwitchingIterator.java:135)
>         at 
> org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:182)
>         at 
> org.apache.accumulo.server.problems.ProblemReportingIterator.seek(ProblemReportingIterator.java:94)
>         at 
> org.apache.accumulo.core.iterators.system.MultiIterator.seek(MultiIterator.java:105)
>         at 
> org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at 
> org.apache.accumulo.core.iterators.system.StatsIterator.seek(StatsIterator.java:64)
>         at 
> org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at 
> org.apache.accumulo.core.iterators.system.DeletingIterator.seek(DeletingIterator.java:67)
>         at 
> org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at 
> org.apache.accumulo.core.iterators.SkippingIterator.seek(SkippingIterator.java:42)
>         at 
> org.apache.accumulo.core.iterators.system.ColumnFamilySkippingIterator.seek(ColumnFamilySkippingIterator.java:123)
>         at 
> org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at org.apache.accumulo.core.iterators.Filter.seek(Filter.java:64)
>         at 
> org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at org.apache.accumulo.core.iterators.Filter.seek(Filter.java:64)
>         at 
> org.apache.accumulo.core.iterators.system.SynchronizedIterator.seek(SynchronizedIterator.java:56)
>         at 
> org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:101)
>         at 
> org.apache.accumulo.core.iterators.user.VersioningIterator.seek(VersioningIterator.java:81)
>         at 
> org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.readNext(SourceSwitchingIterator.java:135)
>         at 
> org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:182)
>         at org.apache.accumulo.tserver.Tablet.nextBatch(Tablet.java:1664)
>         at org.apache.accumulo.tserver.Tablet.access$3200(Tablet.java:174)
>         at org.apache.accumulo.tserver.Tablet$Scanner.read(Tablet.java:1804)
>         at 
> org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$NextBatchTask.run(TabletServer.java:1081)
>         at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at 
> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1438)
>         ... 62 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to