[ 
https://issues.apache.org/jira/browse/HADOOP-5713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alban Chevignard updated HADOOP-5713:
-------------------------------------

    Description: 
If a data node goes down while a file is being written do HDFS, the write fails 
with the following errors:
{noformat} 
09/04/20 17:15:39 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException:
Bad connect ack with firstBadLink 192.168.0.66:50010
09/04/20 17:15:39 INFO dfs.DFSClient: Abandoning block 
blk_-6792221430152215651_1003
09/04/20 17:15:45 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException:
Bad connect ack with firstBadLink 192.168.0.66:50010
09/04/20 17:15:45 INFO dfs.DFSClient: Abandoning block 
blk_-1056044503329698571_1003
09/04/20 17:15:51 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException:
Bad connect ack with firstBadLink 192.168.0.66:50010
09/04/20 17:15:51 INFO dfs.DFSClient: Abandoning block 
blk_-1144491637577072681_1003
09/04/20 17:15:57 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException:
Bad connect ack with firstBadLink 192.168.0.66:50010
09/04/20 17:15:57 INFO dfs.DFSClient: Abandoning block 
blk_6574618270268421892_1003
09/04/20 17:16:03 WARN dfs.DFSClient: DataStreamer Exception: 
java.io.IOException:
Unable to create new block.
        at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2387)
        at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1746)
        at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1924)
09/04/20 17:16:03 WARN dfs.DFSClient: Error Recovery for block 
blk_6574618270268421892_1003 bad datanode[1]
{noformat} 

The tests were done with the following configuration:
* Hadoop version 0.18.3
* 3 data nodes with replication count of 2
* 1 GB file write
* 1 data node taken down during write

This issue seems to be caused by the fact that there is a delay between the 
time a data node goes down and the time it is marked as dead by the name node. 
This delay is unavoidable, but the name node should not keep allocating new 
blocks to data nodes that are known to be down by the client. Even by adjusting 
{{heartbeat.recheck.interval}}, there is still a window during which this issue 
can occur.

One possible fix would be to allow clients to exclude known bad data nodes when 
allocating new blocks. See {{failed_write.patch}} for an example.

  was:
If a data node goes down while a file is being written do HDFS, the write fails 
with the following errors:
{noformat} 
09/04/20 17:15:39 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 192.168.0.66:50010
09/04/20 17:15:39 INFO dfs.DFSClient: Abandoning block 
blk_-6792221430152215651_1003
09/04/20 17:15:45 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 192.168.0.66:50010
09/04/20 17:15:45 INFO dfs.DFSClient: Abandoning block 
blk_-1056044503329698571_1003
09/04/20 17:15:51 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 192.168.0.66:50010
09/04/20 17:15:51 INFO dfs.DFSClient: Abandoning block 
blk_-1144491637577072681_1003
09/04/20 17:15:57 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 192.168.0.66:50010
09/04/20 17:15:57 INFO dfs.DFSClient: Abandoning block 
blk_6574618270268421892_1003
09/04/20 17:16:03 WARN dfs.DFSClient: DataStreamer Exception: 
java.io.IOException: Unable to create new block.
        at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2387)
        at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1746)
        at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1924)
09/04/20 17:16:03 WARN dfs.DFSClient: Error Recovery for block 
blk_6574618270268421892_1003 bad datanode[1]
{noformat} 

The tests were done with the following configuration:
* Hadoop version 0.18.3
* 3 data nodes with replication count of 2
* 1 GB file write
* 1 data node taken down during write

This issue seems to be caused by the fact that there is a delay between the 
time a data node goes down and the time it is marked as dead by the name node. 
This delay is unavoidable, but the name node should not keep allocating new 
blocks to data nodes that are known to be down by the client. Even by adjusting 
{{heartbeat.recheck.interval}}, there is still a window during which this issue 
can occur.

One possible fix would be to allow clients to exclude known bad data nodes when 
allocating new blocks. See {{failed_write.patch}} for an example.


> File write fails after data node goes down
> ------------------------------------------
>
>                 Key: HADOOP-5713
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5713
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Alban Chevignard
>         Attachments: failed_write.patch
>
>
> If a data node goes down while a file is being written do HDFS, the write 
> fails with the following errors:
> {noformat} 
> 09/04/20 17:15:39 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:39 INFO dfs.DFSClient: Abandoning block 
> blk_-6792221430152215651_1003
> 09/04/20 17:15:45 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:45 INFO dfs.DFSClient: Abandoning block 
> blk_-1056044503329698571_1003
> 09/04/20 17:15:51 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:51 INFO dfs.DFSClient: Abandoning block 
> blk_-1144491637577072681_1003
> 09/04/20 17:15:57 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.io.IOException:
> Bad connect ack with firstBadLink 192.168.0.66:50010
> 09/04/20 17:15:57 INFO dfs.DFSClient: Abandoning block 
> blk_6574618270268421892_1003
> 09/04/20 17:16:03 WARN dfs.DFSClient: DataStreamer Exception: 
> java.io.IOException:
> Unable to create new block.
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2387)
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1746)
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1924)
> 09/04/20 17:16:03 WARN dfs.DFSClient: Error Recovery for block 
> blk_6574618270268421892_1003 bad datanode[1]
> {noformat} 
> The tests were done with the following configuration:
> * Hadoop version 0.18.3
> * 3 data nodes with replication count of 2
> * 1 GB file write
> * 1 data node taken down during write
> This issue seems to be caused by the fact that there is a delay between the 
> time a data node goes down and the time it is marked as dead by the name 
> node. This delay is unavoidable, but the name node should not keep allocating 
> new blocks to data nodes that are known to be down by the client. Even by 
> adjusting {{heartbeat.recheck.interval}}, there is still a window during 
> which this issue can occur.
> One possible fix would be to allow clients to exclude known bad data nodes 
> when allocating new blocks. See {{failed_write.patch}} for an example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to