[ https://issues.apache.org/jira/browse/HADOOP-5713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716160#action_12716160 ]
dhruba borthakur commented on HADOOP-5713: ------------------------------------------ @Todd: if you have a reproducible test case, can you pl set the value of dfs.client.block.write.retries to more than 10 minutes and then retry the test case to see if it encounters the same problem. > File write fails after data node goes down > ------------------------------------------ > > Key: HADOOP-5713 > URL: https://issues.apache.org/jira/browse/HADOOP-5713 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Reporter: Alban Chevignard > Attachments: failed_write.patch > > > If a data node goes down while a file is being written do HDFS, the write > fails with the following errors: > {noformat} > 09/04/20 17:15:39 INFO dfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: > Bad connect ack with firstBadLink 192.168.0.66:50010 > 09/04/20 17:15:39 INFO dfs.DFSClient: Abandoning block > blk_-6792221430152215651_1003 > 09/04/20 17:15:45 INFO dfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: > Bad connect ack with firstBadLink 192.168.0.66:50010 > 09/04/20 17:15:45 INFO dfs.DFSClient: Abandoning block > blk_-1056044503329698571_1003 > 09/04/20 17:15:51 INFO dfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: > Bad connect ack with firstBadLink 192.168.0.66:50010 > 09/04/20 17:15:51 INFO dfs.DFSClient: Abandoning block > blk_-1144491637577072681_1003 > 09/04/20 17:15:57 INFO dfs.DFSClient: Exception in createBlockOutputStream > java.io.IOException: > Bad connect ack with firstBadLink 192.168.0.66:50010 > 09/04/20 17:15:57 INFO dfs.DFSClient: Abandoning block > blk_6574618270268421892_1003 > 09/04/20 17:16:03 WARN dfs.DFSClient: DataStreamer Exception: > java.io.IOException: > Unable to create new block. > at > org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2387) > at > org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1746) > at > org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1924) > 09/04/20 17:16:03 WARN dfs.DFSClient: Error Recovery for block > blk_6574618270268421892_1003 bad datanode[1] > {noformat} > The tests were done with the following configuration: > * Hadoop version 0.18.3 > * 3 data nodes with replication count of 2 > * 1 GB file write > * 1 data node taken down during write > This issue seems to be caused by the fact that there is a delay between the > time a data node goes down and the time it is marked as dead by the name > node. This delay is unavoidable, but the name node should not keep allocating > new blocks to data nodes that are known to be down by the client. Even by > adjusting {{heartbeat.recheck.interval}}, there is still a window during > which this issue can occur. > One possible fix would be to allow clients to exclude known bad data nodes > when allocating new blocks. See {{failed_write.patch}} for an example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.