[jira] [Commented] (HDFS-9106) Transfer failure during pipeline recovery causes permanent write failures

Jing Zhao (JIRA) Mon, 21 Sep 2015 14:24:43 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901442#comment-14901442
 ]


Jing Zhao commented on HDFS-9106:
---------------------------------

bq. Transfer timeout needs to be different from per-packet timeout.

+1 for changing the timeout.

bq. if the partial block transfer fails, the write will fail permanently 
without retrying or continuing with whatever is in the pipeline

If the partial block transfer fails, and if {{bestEffort}} is enabled, the 
current code will still use the remaining datanodes to setup the pipeline? But 
looks like the {{nodes}} may still include the new DN after the failure though.

> Transfer failure during pipeline recovery causes permanent write failures
> -------------------------------------------------------------------------
>
>                 Key: HDFS-9106
>                 URL: https://issues.apache.org/jira/browse/HDFS-9106
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>         Attachments: HDFS-9106-poc.patch
>
>
> When a new node is added to a write pipeline during flush/sync, if the 
> partial block transfer fails, the write will fail permanently without 
> retrying or continuing with whatever is in the pipeline. 
> The transfer often fails in busy clusters due to timeout. There is no 
> per-packet ACK between client and datanode or between source and target 
> datanodes. If the total transfer time exceeds the configured timeout + 10 
> seconds (2 * 5 seconds slack), it is considered failed.  Naturally, the 
> failure rate is higher with bigger block sizes.
> I propose following changes:
> - Transfer timeout needs to be different from per-packet timeout.
> - transfer should be retried if fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9106) Transfer failure during pipeline recovery causes permanent write failures

Reply via email to