[ https://issues.apache.org/jira/browse/HDFS-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Walter Su updated HDFS-9752: ---------------------------- Assignee: Walter Su Status: Patch Available (was: Open) > Permanent write failures may happen to slow writers during datanode rolling > upgrades > ------------------------------------------------------------------------------------ > > Key: HDFS-9752 > URL: https://issues.apache.org/jira/browse/HDFS-9752 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Kihwal Lee > Assignee: Walter Su > Priority: Critical > Attachments: HDFS-9752.01.patch > > > When datanodes are being upgraded, an out-of-band ack is sent upstream and > the client does a pipeline recovery. The client may hit this multiple times > as more nodes get upgraded. This normally does not cause any issue, but if > the client is holding the stream open without writing any data during this > time, a permanent write failure can occur. > This is because there is a limit of 5 recovery trials for the same packet, > which is tracked by "last acked sequence number". Since the empty heartbeat > packets for an idle output stream does not increment the sequence number, the > write will fail after it seeing 5 pipeline breakages by datanode upgrades. > This check/limit was added to avoid spinning until running out of nodes in > the cluster due to a corruption or any other irrecoverable conditions. The > datanode upgrade-restart should be excluded from the count. -- This message was sent by Atlassian JIRA (v6.3.4#6332)