[ 
https://issues.apache.org/jira/browse/HDFS-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122878#comment-16122878
 ] 

Vinayakumar B commented on HDFS-11738:
--------------------------------------

I see that, Following changes in 
{{DFSInputStream#hedgedFetchBlockByteRange(..)}} present in both HDFS-11303 and 
this jira, 
{code}
         futures.add(firstRequest);
+        Future<ByteBuffer> future = null;
         try {
-          Future<ByteBuffer> future = hedgedService.poll(
+          future = hedgedService.poll(
               conf.getHedgedReadThresholdMillis(), TimeUnit.MILLISECONDS);
           if (future != null) {
             ByteBuffer result = future.get();
@@ -1142,16 +1143,18 @@ private void hedgedFetchBlockByteRange(LocatedBlock 
block, long start,
           }
           DFSClient.LOG.debug("Waited {}ms to read from {}; spawning hedged "
               + "read", conf.getHedgedReadThresholdMillis(), chosenNode.info);
-          // Ignore this node on next go around.
-          ignored.add(chosenNode.info);
           dfsClient.getHedgedReadMetrics().incHedgedReadOps();
           // continue; no need to refresh block locations
         } catch (ExecutionException e) {
-          // Ignore
+          futures.remove(future);
         } catch (InterruptedException e) {
           throw new InterruptedIOException(
               "Interrupted while waiting for reading task");
         }
+        // Ignore this node on next go around.
+        // If poll timeout and the request still ongoing, don't consider it
+        // again. If read data failed, don't consider it either.
+        ignored.add(chosenNode.info);
       } else {
         // We are starting up a 'hedged' read. We have a read already
         // ongoing. Call getBestNodeDNAddrPair instead of chooseDataNode.
{code}
I think its fair to commit HDFS-11303 first and give credit for [~jzhuge]'s 
efforts, as it has associated test written along with the change.

I can update the patch again once HDFS-11303 committed. Anyway test in my patch 
will fail even after HDFS-11303 is committed. i.e. HDFS-11303 is not exactly 
the fix for this issue.
So lets get HDFS-11303 committed first.

> Hedged pread takes more time when block moved from initial locations
> --------------------------------------------------------------------
>
>                 Key: HDFS-11738
>                 URL: https://issues.apache.org/jira/browse/HDFS-11738
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>            Reporter: Vinayakumar B
>            Assignee: Vinayakumar B
>         Attachments: HDFS-11738-01.patch, HDFS-11738-02.patch
>
>
> Scenario : 
> Same as HDFS-11708.
> During Hedge read, 
> 1. First two locations fails to read the data in hedged mode.
> 2. chooseData refetches locations and adds a future to read from DN3.
> 3. after adding future to DN3, main thread goes for refetching locations in 
> loop and stucks there till all 3  retries to fetch locations exhausted, which 
> consumes ~20 seconds with exponential retry time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to