Kihwal Lee created HDFS-9752:
--------------------------------
Summary: Permanent write failures may happen to slow writers
during datanode rolling upgrades
Key: HDFS-9752
URL: https://issues.apache.org/jira/browse/HDFS-9752
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Kihwal Lee
Priority: Critical
When datanodes are being upgraded, an out-of-band ack is sent upstream and the
client does a pipeline recovery. The client may hit this multiple times as more
nodes get upgraded. This normally does not cause any issue, but if the client
is holding the stream open without writing any data during this time, a
permanent write failure can occur.
This is because there is a limit of 5 recovery trials for the same packet,
which is tracked by "last acked sequence number". Since the empty heartbeat
packets for an idle output stream does not increment the sequence number, the
write will fail after it seeing 5 pipeline breakages by datanode upgrades.
This check/limit was added to avoid spinning until running out of nodes in the
cluster due to a corruption or any other irrecoverable conditions. The
datanode upgrade-restart should be excluded from the count.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)