[ https://issues.apache.org/jira/browse/HDFS-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122878#comment-16122878 ]
Vinayakumar B commented on HDFS-11738: -------------------------------------- I see that, Following changes in {{DFSInputStream#hedgedFetchBlockByteRange(..)}} present in both HDFS-11303 and this jira, {code} futures.add(firstRequest); + Future<ByteBuffer> future = null; try { - Future<ByteBuffer> future = hedgedService.poll( + future = hedgedService.poll( conf.getHedgedReadThresholdMillis(), TimeUnit.MILLISECONDS); if (future != null) { ByteBuffer result = future.get(); @@ -1142,16 +1143,18 @@ private void hedgedFetchBlockByteRange(LocatedBlock block, long start, } DFSClient.LOG.debug("Waited {}ms to read from {}; spawning hedged " + "read", conf.getHedgedReadThresholdMillis(), chosenNode.info); - // Ignore this node on next go around. - ignored.add(chosenNode.info); dfsClient.getHedgedReadMetrics().incHedgedReadOps(); // continue; no need to refresh block locations } catch (ExecutionException e) { - // Ignore + futures.remove(future); } catch (InterruptedException e) { throw new InterruptedIOException( "Interrupted while waiting for reading task"); } + // Ignore this node on next go around. + // If poll timeout and the request still ongoing, don't consider it + // again. If read data failed, don't consider it either. + ignored.add(chosenNode.info); } else { // We are starting up a 'hedged' read. We have a read already // ongoing. Call getBestNodeDNAddrPair instead of chooseDataNode. {code} I think its fair to commit HDFS-11303 first and give credit for [~jzhuge]'s efforts, as it has associated test written along with the change. I can update the patch again once HDFS-11303 committed. Anyway test in my patch will fail even after HDFS-11303 is committed. i.e. HDFS-11303 is not exactly the fix for this issue. So lets get HDFS-11303 committed first. > Hedged pread takes more time when block moved from initial locations > -------------------------------------------------------------------- > > Key: HDFS-11738 > URL: https://issues.apache.org/jira/browse/HDFS-11738 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client > Reporter: Vinayakumar B > Assignee: Vinayakumar B > Attachments: HDFS-11738-01.patch, HDFS-11738-02.patch > > > Scenario : > Same as HDFS-11708. > During Hedge read, > 1. First two locations fails to read the data in hedged mode. > 2. chooseData refetches locations and adds a future to read from DN3. > 3. after adding future to DN3, main thread goes for refetching locations in > loop and stucks there till all 3 retries to fetch locations exhausted, which > consumes ~20 seconds with exponential retry time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org