[ 
https://issues.apache.org/jira/browse/HDFS-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138306#comment-15138306
 ] 

Walter Su commented on HDFS-9752:
---------------------------------

Thanks all for reviewing the patch.
The patch depends on HDFS-9347. I just cherry-picked it to 2.6.5. Now I've 
uploaded the separate patch for 2.7/2.6.

> Permanent write failures may happen to slow writers during datanode rolling 
> upgrades
> ------------------------------------------------------------------------------------
>
>                 Key: HDFS-9752
>                 URL: https://issues.apache.org/jira/browse/HDFS-9752
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Walter Su
>            Priority: Critical
>         Attachments: HDFS-9752-branch-2.6.03.patch, 
> HDFS-9752-branch-2.7.03.patch, HDFS-9752.01.patch, HDFS-9752.02.patch, 
> HDFS-9752.03.patch, HdfsWriter.java
>
>
> When datanodes are being upgraded, an out-of-band ack is sent upstream and 
> the client does a pipeline recovery. The client may hit this multiple times 
> as more nodes get upgraded.  This normally does not cause any issue, but if 
> the client is holding the stream open without writing any data during this 
> time, a permanent write failure can occur.
> This is because there is a limit of 5 recovery trials for the same packet, 
> which is tracked by "last acked sequence number". Since the empty heartbeat 
> packets for an idle output stream does not increment the sequence number, the 
> write will fail after it seeing 5 pipeline breakages by datanode upgrades.
> This check/limit was added to avoid spinning until running out of nodes in 
> the cluster due to a corruption or any other irrecoverable conditions.  The 
> datanode upgrade-restart  should be excluded from the count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to