[ 
https://issues.apache.org/jira/browse/HADOOP-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur updated HADOOP-2883:
-------------------------------------

    Attachment: packetResponse.patch

This patch fixes three problems:

1. The datanode used to ack a packet before its content was flushed from the 
buffered output stream for the block file. This means that if the flush fails, 
then  data could get corrupted. This patch flushes the block file and the 
metadata file before sending a positive ack to the client. I have verified that 
this does degrade performance of dfsiotest and randonwriter.
2. The original timeout value of 1 minute * length-of-pipeline has been 
restored. This change reduces the number of socket timeouts when a datanode is 
heavily loaded.
3. The Datanode verifies that a packet replay does not create holes in the 
block file (sparse files). The offset-in-block of every packet should be less 
than or equal to the size of the current block file.

> Extensive write failures
> ------------------------
>
>                 Key: HADOOP-2883
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2883
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.16.1
>
>         Attachments: packetResponse.patch
>
>
> With the new release 0.16.0 we experience extensive write failures under 
> heavy load.
> The job shuffles 300TB on 1400 nodes and runs 3 waves of 2500 reducers. Each 
> reducer uses libhdfs to write to around 70 dfs files simultaneously. We did 
> not experience particular write problems up to nightly build #835 on hadoopqa 
> (Jan 28),
> but now with released 0.16.0 (candidate 2) we see a lot of exceptions related 
> to 'all datanodes are bad':
> typical exception(s):
> 08/02/22 10:34:47 WARN fs.DFSClient: Error Recovery for block 
> blk_434406883423887779 in pipeline xxx.xxx.xxx.146:50010, 
> xxx.xxx.xxx.224:50010: bad datanode xxx.xxx.xxx.146:50010
> 08/02/22 10:34:51 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:34:51 WARN fs.DFSClient: Error Recovery for block 
> blk_-1957866292089920792 in pipeline xxx.xxx.xxx.147:50010, 
> xxx.xxx.xxx.10:50010: bad datanode xxx.xxx.xxx.147:50010
> 08/02/22 10:34:54 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:34:54 WARN fs.DFSClient: Error Recovery for block 
> blk_-5265240773298481019 in pipeline xxx.xxx.xxx.152:50010, 
> xxx.xxx.xxx.71:50010: bad datanode xxx.xxx.xxx.152:50010
> 08/02/22 10:34:54 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:34:54 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed outxxx.xxx.xxx.166:50010
> 08/02/22 10:34:55 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block 
> blk_8456718220685890569 in pipeline xxx.xxx.xxx.158:50010, 
> xxx.xxx.xxx.225:50010: bad datanode xxx.xxx.xxx.158:50010
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block 
> blk_1420965154382429572 in pipeline xxx.xxx.xxx.169:50010, 
> xxx.xxx.xxx.221:50010: bad datanode xxx.xxx.xxx.169:50010
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block 
> blk_-519424763987472708 in pipeline xxx.xxx.xxx.154:50010, 
> xxx.xxx.xxx.37:50010: bad datanode xxx.xxx.xxx.154:50010
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block 
> blk_-8376556524788296783 in pipeline xxx.xxx.xxx.154:50010, 
> xxx.xxx.xxx.212:50010: bad datanode xxx.xxx.xxx.154:50010
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block 
> blk_-2429564741658530079 in pipeline xxx.xxx.xxx.160:50010, 
> xxx.xxx.xxx.105:50010: bad datanode xxx.xxx.xxx.160:50010
> 08/02/22 10:35:00 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:00 WARN fs.DFSClient: Error Recovery for block 
> blk_-6653210787685458124 in pipeline xxx.xxx.xxx.143:50010, 
> xxx.xxx.xxx.37:50010: bad datanode xxx.xxx.xxx.143:50010
> 08/02/22 10:35:01 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:01 WARN fs.DFSClient: Error Recovery for block 
> blk_7515160028005424426 in pipeline xxx.xxx.xxx.167:50010, 
> xxx.xxx.xxx.152:50010: bad datanode xxx.xxx.xxx.167:50010
> 08/02/22 10:35:03 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:03 WARN fs.DFSClient: Error Recovery for block 
> blk_-7191475898558388503 in pipeline xxx.xxx.xxx.139:50010, 
> xxx.xxx.xxx.6:50010: bad datanode xxx.xxx.xxx.139:50010
> 08/02/22 10:35:03 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:03 WARN fs.DFSClient: Error Recovery for block 
> blk_-340745015348833165 in pipeline xxx.xxx.xxx.141:50010, 
> xxx.xxx.xxx.153:50010: bad datanode xxx.xxx.xxx.141:50010
> 08/02/22 10:35:04 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:04 WARN fs.DFSClient: Error Recovery for block 
> blk_-6861254790596076341 in pipeline xxx.xxx.xxx.157:50010, 
> xxx.xxx.xxx.224:50010: bad datanode xxx.xxx.xxx.157:50010
> 08/02/22 10:35:14 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:14 INFO fs.DFSClient: Abandoning block blk_6188945400680100475
> 08/02/22 10:35:14 INFO fs.DFSClient: Waiting to find target node: 
> xxx.xxx.xxx.161:50010
> 08/02/22 10:35:43 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:47 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:48 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:49 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:49 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:50 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:50 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:50 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:53 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:54 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:57 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:35:57 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:36:03 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:36:04 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:36:06 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:36:06 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> 08/02/22 10:36:07 INFO fs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: Read timed out
> Exception in thread "main" java.io.IOException: All datanodes 
> xxx.xxx.xxx.83:50010 are bad. Aborting...
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:1839)
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1479)
>       at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1571)
> Call to org.apache.hadoop.fs.FSDataOutputStream::write failed!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to