[
https://issues.apache.org/jira/browse/HADOOP-5713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716161#action_12716161
]
Todd Lipcon commented on HADOOP-5713:
-------------------------------------
Don't currently have a reproducible test case - the failure injection might be
slightly tough. I'll see if I can cook something up, though - that is
definitely the first step.
> File write fails after data node goes down
> ------------------------------------------
>
> Key: HADOOP-5713
> URL: https://issues.apache.org/jira/browse/HADOOP-5713
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Reporter: Alban Chevignard
> Attachments: failed_write.patch
>
>
> If a data node goes down while a file is being written do HDFS, the write
> fails with the following errors:
> {noformat}
> 09/04/20 17:15:39 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:39 INFO dfs.DFSClient: Abandoning block
> blk_-6792221430152215651_1003
> 09/04/20 17:15:45 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:45 INFO dfs.DFSClient: Abandoning block
> blk_-1056044503329698571_1003
> 09/04/20 17:15:51 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:51 INFO dfs.DFSClient: Abandoning block
> blk_-1144491637577072681_1003
> 09/04/20 17:15:57 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:57 INFO dfs.DFSClient: Abandoning block
> blk_6574618270268421892_1003
> 09/04/20 17:16:03 WARN dfs.DFSClient: DataStreamer Exception:
> java.io.IOException:
> Unable to create new block.
> at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2387)
> at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1746)
> at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1924)
> 09/04/20 17:16:03 WARN dfs.DFSClient: Error Recovery for block
> blk_6574618270268421892_1003 bad datanode[1]
> {noformat}
> The tests were done with the following configuration:
> * Hadoop version 0.18.3
> * 3 data nodes with replication count of 2
> * 1 GB file write
> * 1 data node taken down during write
> This issue seems to be caused by the fact that there is a delay between the
> time a data node goes down and the time it is marked as dead by the name
> node. This delay is unavoidable, but the name node should not keep allocating
> new blocks to data nodes that are known to be down by the client. Even by
> adjusting {{heartbeat.recheck.interval}}, there is still a window during
> which this issue can occur.
> One possible fix would be to allow clients to exclude known bad data nodes
> when allocating new blocks. See {{failed_write.patch}} for an example.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.