[ 
https://issues.apache.org/jira/browse/HADOOP-5713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716160#action_12716160
 ] 

dhruba borthakur commented on HADOOP-5713:
------------------------------------------

@Todd: if you have a reproducible test case, can you pl  set the value of 
dfs.client.block.write.retries to more than 10 minutes and then retry the test 
case to see if it encounters the same problem.

> File write fails after data node goes down
> ------------------------------------------
>
>                 Key: HADOOP-5713
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5713
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Alban Chevignard
>         Attachments: failed_write.patch
>
>
> If a data node goes down while a file is being written do HDFS, the write 
> fails with the following errors:
> {noformat} 
> 09/04/20 17:15:39 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:39 INFO dfs.DFSClient: Abandoning block 
> blk_-6792221430152215651_1003
> 09/04/20 17:15:45 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:45 INFO dfs.DFSClient: Abandoning block 
> blk_-1056044503329698571_1003
> 09/04/20 17:15:51 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:51 INFO dfs.DFSClient: Abandoning block 
> blk_-1144491637577072681_1003
> 09/04/20 17:15:57 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:57 INFO dfs.DFSClient: Abandoning block 
> blk_6574618270268421892_1003
> 09/04/20 17:16:03 WARN dfs.DFSClient: DataStreamer Exception: 
> java.io.IOException:
> Unable to create new block.
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2387)
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1746)
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1924)
> 09/04/20 17:16:03 WARN dfs.DFSClient: Error Recovery for block 
> blk_6574618270268421892_1003 bad datanode[1]
> {noformat} 
> The tests were done with the following configuration:
> * Hadoop version 0.18.3
> * 3 data nodes with replication count of 2
> * 1 GB file write
> * 1 data node taken down during write
> This issue seems to be caused by the fact that there is a delay between the 
> time a data node goes down and the time it is marked as dead by the name 
> node. This delay is unavoidable, but the name node should not keep allocating 
> new blocks to data nodes that are known to be down by the client. Even by 
> adjusting {{heartbeat.recheck.interval}}, there is still a window during 
> which this issue can occur.
> One possible fix would be to allow clients to exclude known bad data nodes 
> when allocating new blocks. See {{failed_write.patch}} for an example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to