[
https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617004#action_12617004
]
Raghu Angadi commented on HADOOP-3831:
--------------------------------------
Before 0.17, DataNode did not close client socket on its own. So when DFSClient
detects any socket error from a datanode, it marks it as dead and the list of
'deadnodes' is maintained for each open file. This persists through life of the
the open file. So the deadnode will not be contacted even for other blocks.
This policy will not work with 0.17 when there are slow clients.
Apart from slow clients where DataNode closes the connection after an 8 minute
write timeout, HADOOP-3633 in 0.17.2 introduced another (I think more likely to
happen) case where client sees errors from a datanode : if it already has 256
transfers going on.
Couple of fairly simple fixes :
* 1. When client detects a connection failure after it read some bytes from a
DataNode, it should just retry again with same datanode before moving on to
next one.
*- This will fix the write-timeout problem as reported in this jira.
*- Checksum errors will still be handled the same way as before.
* 2. Clear the 'deadnode' list when client moves to a new block.
*- This will reduce the effect of HADOOP-3633 when there are a lot of
clients.
*- the larger issue remains : When is a datanode really dead and when
should it be retried after transient errors?
This jira mainly requires (1). We could postpone (2) until we get more
experience with HADOOP-3633. Thoughts?
> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
> Key: HADOOP-3831
> URL: https://issues.apache.org/jira/browse/HADOOP-3831
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.17.1
> Reporter: Christian Kunz
> Assignee: Raghu Angadi
>
> Some of our applications read through certain files from dfs (using libhdfs)
> much slower than through others, such that they trigger the write timeout
> introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got
> exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
> at
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
> at
> org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039)
> at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
> at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from
> blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture
> EOF from inputStream
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
> at
> org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
> at
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
> at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
> at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
> at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
> at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
> at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
> at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block
> blk_3304550638094049753 from any node: java.io.IOException: No live nodes
> contain current block
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.