[
https://issues.apache.org/jira/browse/HDFS-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134388#comment-15134388
]
Arpit Agarwal commented on HDFS-9752:
-------------------------------------
bq. Thanks for the advises. Uploaded 02 patch. The test now takes ~30s. But
it's still difficult to remove the sleep used for waiting DN shutdown.
Walter, what do you think of removing the loop in the test altogether and
trying out this suggestion?
_The test can just verify that pipelineRecoveryCount is not incremented after
DN restart and pipeline recovery. If it is not incremented after one iteration
the ++pipelineRecoveryCount > 5 check will never be triggered._
> Permanent write failures may happen to slow writers during datanode rolling
> upgrades
> ------------------------------------------------------------------------------------
>
> Key: HDFS-9752
> URL: https://issues.apache.org/jira/browse/HDFS-9752
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Kihwal Lee
> Assignee: Walter Su
> Priority: Critical
> Attachments: HDFS-9752.01.patch, HDFS-9752.02.patch, HdfsWriter.java
>
>
> When datanodes are being upgraded, an out-of-band ack is sent upstream and
> the client does a pipeline recovery. The client may hit this multiple times
> as more nodes get upgraded. This normally does not cause any issue, but if
> the client is holding the stream open without writing any data during this
> time, a permanent write failure can occur.
> This is because there is a limit of 5 recovery trials for the same packet,
> which is tracked by "last acked sequence number". Since the empty heartbeat
> packets for an idle output stream does not increment the sequence number, the
> write will fail after it seeing 5 pipeline breakages by datanode upgrades.
> This check/limit was added to avoid spinning until running out of nodes in
> the cluster due to a corruption or any other irrecoverable conditions. The
> datanode upgrade-restart should be excluded from the count.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)