[ 
https://issues.apache.org/jira/browse/HDFS-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691480#comment-15691480
 ] 

Uma Maheswara Rao G commented on HDFS-11164:
--------------------------------------------

[~rakeshr], Thanks for reporting the issue.
Adding extra error code make sense to me.

But the current patch may not solve for avoiding unnecessary retries issue I 
think.
Please check the following cases, correct me if I am wrong.

{code}
 // Check that the block movement failure(s) are only due to block pinning.
      // If yes, just mark as failed and exit without retries.
      if(!hasFailed && hasBlockPinningFailure){
        hasFailed = hasBlockPinningFailure;
        result.setRetryFailed();
      } else if (hasFailed && !hasSuccess) {
        if (retryCount.get() == retryMaxAttempts) {
          result.setRetryFailed();
          LOG.error("Failed to move some block's after "
              + retryMaxAttempts + " retries.");
          return result;
        } else {
          retryCount.incrementAndGet();
        }
      } else {
        // Reset retry count if no failure.
        retryCount.set(0);
      }
      result.updateHasRemaining(hasFailed);
      return result;
{code}
Here !hasFailed && hasBlockPinningFailure case is targeting for only pinned 
failure and no normal failures right? if so, when there are normal failures and 
pinned failures together, it will still retry?
If it retries, it may scan that paths again and try to move even they are 
pinned blocks.
We may need to think this in a bit different way than node level failures I 
think.
One thought is, Failed blocks due to pinned can be stored separately and when 
retry happens and if blocks exist in failedDueToPinned list, then skip to add 
them into PendingMoves? Just a thought, we need to check the feasibility.


> Mover should avoid unnecessary retries if the block is pinned
> -------------------------------------------------------------
>
>                 Key: HDFS-11164
>                 URL: https://issues.apache.org/jira/browse/HDFS-11164
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>            Reporter: Rakesh R
>            Assignee: Rakesh R
>         Attachments: HDFS-11164-00.patch, HDFS-11164-01.patch
>
>
> When mover is trying to move a pinned block to another datanode, it will 
> internally hits the following IOException and mark the block movement as 
> {{failure}}. Since the Mover has {{dfs.mover.retry.max.attempts}} configs, it 
> will continue moving this block until it reaches {{retryMaxAttempts}}. If the 
> block movement failure(s) are only due to block pinning, then retry is 
> unnecessary. The idea of this jira is to avoid retry attempts of pinned 
> blocks as they won't be able to move to a different node. 
> {code}
> 2016-11-22 10:56:10,537 WARN 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher: Failed to move 
> blk_1073741825_1001 with size=52 from 127.0.0.1:19501:DISK to 
> 127.0.0.1:19758:ARCHIVE through 127.0.0.1:19501
> java.io.IOException: Got error, status=ERROR, status message opReplaceBlock 
> BP-1772076264-10.252.146.200-1479792322960:blk_1073741825_1001 received 
> exception java.io.IOException: Got error, status=ERROR, status message Not 
> able to copy block 1073741825 to /127.0.0.1:19826 because it's pinned , copy 
> block BP-1772076264-10.252.146.200-1479792322960:blk_1073741825_1001 from 
> /127.0.0.1:19501, reportedBlock move is failed
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:118)
>       at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.receiveResponse(Dispatcher.java:417)
>       at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.dispatch(Dispatcher.java:358)
>       at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.access$5(Dispatcher.java:322)
>       at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$1.run(Dispatcher.java:1075)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to