[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Stefan Will (JIRA) Wed, 10 Sep 2008 17:35:46 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630047#action_12630047
 ]


Stefan Will commented on HADOOP-3831:
-------------------------------------

I'm not sure whether this is the same issue or not, but on my 4 slave
cluster, setting the below parameter doesn't seem to fix the issue.

What I'm seeing is that occasionally data nodes stop responding for up to 10
minutes at a time. In this case, the TaskTrackers will mark the nodes as
dead, and occasionally the namenode will mark them as dead as well (you can
see the "Last Contact" time steadily increase for a random node at a time
every half hour or so.

This seems to be happening during times of high disk utilization.

Two more things I noticed that happen when the datanodes become unresponsive
(i.e. The "Last Contact" field on the namenode keeps increasing):

1. The datanode process seem to be completely hung for a while, including
its Jetty web interface, sometimes for over 10 minutes.

2. The task tracker on the same machine keeps humming along, sending regular
heartbeats

To me this looks like there is some sort of temporary deadlock in the
datanode that keeps it from responding to requests. Perhaps it's the block
report being generated ?

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, 
> HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) 
> much slower than through others, such that they trigger the write timeout 
> introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got 
> exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for 
> channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at 
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at 
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at 
> org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from 
> blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture 
> EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at 
> org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at 
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block 
> blk_3304550638094049753 from any node:  java.io.IOException: No live nodes 
> contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Reply via email to