[
https://issues.apache.org/jira/browse/HADOOP-5713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708437#action_12708437
]
dhruba borthakur commented on HADOOP-5713:
------------------------------------------
> when createOutputStream fails, a dfs client should take the failed datanode
> out of the pipeline, bump the block's ge
@Hairong: This was purposely *not done* when we did the
client-streaming-data-to-datanodes. The reason being that when you do this, you
reduce the robustness of the block. You would remember that when a replica in
the pipeline fails, the client continues writing to the other replicas and the
NN makes no attempt to increase that's block's replication factor until the
file is closed. This means that when we remove a datanode from a pipeline, we
are exposing that block to a larger probability of going "missing or corrupt".
This situation is unavoidable when the client has written partial data to a
block and then encounters an error in the pipeline, in this case we ignore the
bad datanode and continue with the remainder of the datanode(s).
On the other hand, when the createOutputStream fails, we have the luxury of
ignoring all the datanode inthe current pipeline because the client has not yet
written any data to any of the datanodes in the pipeline. We could have ignored
only the bad datanode (as you suggested), but this means that pipeline would be
exposed to a higher probability of encountering a "missing/corrupt" block if
the other two replicas also fail sometime in the near future before the file is
closed. In this case, we can remove this degradation if we fetch an entirely
new pipeline from the NN.
@Alban: Increasing the number of write retries in that case won't help.
I understand your use-case now. The NN takes 10 minutes of no-heartbeats from a
datanode to declare it dead. It is possible for you to set
dfs.client.block.write.retries to a value that causes the client to retry for
more than 10 minutes? In that case, your test case should succeed. The idea is
that if the client does not bail out (but keeps retrying) for more than 10
minutes, it is bound to succeed. Please let us know.
I will also look at your patch in greater detail.
> File write fails after data node goes down
> ------------------------------------------
>
> Key: HADOOP-5713
> URL: https://issues.apache.org/jira/browse/HADOOP-5713
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Reporter: Alban Chevignard
> Attachments: failed_write.patch
>
>
> If a data node goes down while a file is being written do HDFS, the write
> fails with the following errors:
> {noformat}
> 09/04/20 17:15:39 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:39 INFO dfs.DFSClient: Abandoning block
> blk_-6792221430152215651_1003
> 09/04/20 17:15:45 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:45 INFO dfs.DFSClient: Abandoning block
> blk_-1056044503329698571_1003
> 09/04/20 17:15:51 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:51 INFO dfs.DFSClient: Abandoning block
> blk_-1144491637577072681_1003
> 09/04/20 17:15:57 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:57 INFO dfs.DFSClient: Abandoning block
> blk_6574618270268421892_1003
> 09/04/20 17:16:03 WARN dfs.DFSClient: DataStreamer Exception:
> java.io.IOException:
> Unable to create new block.
> at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2387)
> at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1746)
> at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1924)
> 09/04/20 17:16:03 WARN dfs.DFSClient: Error Recovery for block
> blk_6574618270268421892_1003 bad datanode[1]
> {noformat}
> The tests were done with the following configuration:
> * Hadoop version 0.18.3
> * 3 data nodes with replication count of 2
> * 1 GB file write
> * 1 data node taken down during write
> This issue seems to be caused by the fact that there is a delay between the
> time a data node goes down and the time it is marked as dead by the name
> node. This delay is unavoidable, but the name node should not keep allocating
> new blocks to data nodes that are known to be down by the client. Even by
> adjusting {{heartbeat.recheck.interval}}, there is still a window during
> which this issue can occur.
> One possible fix would be to allow clients to exclude known bad data nodes
> when allocating new blocks. See {{failed_write.patch}} for an example.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.