I'm not sure whether this is the same issue or not, but on my 4 slave
cluster, setting the below parameter doesn't seem to fix the issue.

What I'm seeing is that occasionally data nodes stop responding for up to 10
minutes at a time. In this case, the TaskTrackers will mark the nodes as
dead, and occasionally the namenode will mark them as dead as well (you can
see the "Last Contact" time steadily increase for a random node at a time
every half hour or so.

This seems to be happening during times of high disk utilization.

-- Stefan



> From: Espen Amble Kolstad <[EMAIL PROTECTED]>
> Reply-To: <[email protected]>
> Date: Mon, 8 Sep 2008 12:40:01 +0200
> To: <[email protected]>
> Subject: Re: Could not obtain block: blk_-2634319951074439134_1129
> file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data
> 
> There's a JIRA on this already:
> https://issues.apache.org/jira/browse/HADOOP-3831
> Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems
> to do the trick for now.
> 
> Espen
> 
> On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad <[EMAIL PROTECTED]> 
> wrote:
>> Hi,
>> 
>> Thanks for the tip!
>> 
>> I tried revision 692572 of the 0.18 branch, but I still get the same errors.
>> 
>> On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
>>> The DFS errors might have been caused by
>>> 
>>> http://issues.apache.org/jira/browse/HADOOP-4040
>>> 
>>> thanks,
>>> dhruba
>>> 
>>> On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das <[EMAIL PROTECTED]> wrote:
>>>> These exceptions are apparently coming from the dfs side of things. Could
>>>> someone from the dfs side please look at these?
>>>> 
>>>> On 9/5/08 3:04 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
>>>>> Hi,
>>>>> 
>>>>> Thanks!
>>>>> The patch applies without change to hadoop-0.18.0, and should be
>>>>> included in a 0.18.1.
>>>>> 
>>>>> However, I'm still seeing:
>>>>> in hadoop.log:
>>>>> 2008-09-05 11:13:54,805 WARN  dfs.DFSClient - Exception while reading
>>>>> from blk_3428404120239503595_2664 of
>>>>> /user/trank/segments/20080905102650/crawl_generate/part-00010 from
>>>>> somehost:50010: java.io.IOException: Premeture EOF from in
>>>>> putStream
>>>>> 
>>>>> in datanode.log:
>>>>> 2008-09-05 11:15:09,554 WARN  dfs.DataNode -
>>>>> DatanodeRegistration(somehost:50010,
>>>>> storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
>>>>> ipcPort=50020):Got exception while serving
>>>>> blk_-4682098638573619471_2662 to
>>>>> /somehost:
>>>>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
>>>>> for channel to be ready for write. ch :
>>>>> java.nio.channels.SocketChannel[connected local=/somehost:50010
>>>>> remote=/somehost:45244]
>>>>> 
>>>>> These entries in datanode.log happens a few minutes apart repeatedly.
>>>>> I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
>>>>> free memory (so it's not resource starvation).
>>>>> 
>>>>> Espen
>>>>> 
>>>>> On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:
>>>>>>> I started a profile of the reduce-task. I've attached the profiling
>>>>>>> output. It seems from the samples that ramManager.waitForDataToMerge()
>>>>>>> doesn't actually wait.
>>>>>>> Has anybody seen this behavior.
>>>>>> 
>>>>>> This has been fixed in HADOOP-3940
>>>>>> 
>>>>>> On 9/4/08 6:36 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
>>>>>>> I have the same problem on our cluster.
>>>>>>> 
>>>>>>> It seems the reducer-tasks are using all cpu, long before there's
>>>>>>> anything to
>>>>>>> shuffle.
>>>>>>> 
>>>>>>> I started a profile of the reduce-task. I've attached the profiling
>>>>>>> output. It seems from the samples that ramManager.waitForDataToMerge()
>>>>>>> doesn't actually wait.
>>>>>>> Has anybody seen this behavior.
>>>>>>> 
>>>>>>> Espen
>>>>>>> 
>>>>>>> On Thursday 28 August 2008 06:11:42 wangxu wrote:
>>>>>>>> Hi,all
>>>>>>>> I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
>>>>>>>> and running hadoop on one namenode and 4 slaves.
>>>>>>>> attached is my hadoop-site.xml, and I didn't change the file
>>>>>>>> hadoop-default.xml
>>>>>>>> 
>>>>>>>> when data in segments are large,this kind of errors occure:
>>>>>>>> 
>>>>>>>> java.io.IOException: Could not obtain block:
>>>>>>>> blk_-2634319951074439134_1129
>>>>>>>> file=/user/root/crawl_debug/segments/20080825053518/content/part-0000
>>>>>>>> 2/data at
>>>>>>>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
>>>>>>>> nt.jav a:1462) at
>>>>>>>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.
>>>>>>>> java:1 312) at
>>>>>>>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14
>>>>>>>> 17) at java.io.DataInputStream.readFully(DataInputStream.java:178)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j
>>>>>>>> ava:64 ) at
>>>>>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102
>>>>>>>> ) at
>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java
>>>>>>>> :1646) at
>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF
>>>>>>>> ile.ja va:1712) at
>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile
>>>>>>>> .java: 1787) at
>>>>>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq
>>>>>>>> uenceF ileRecordReader.java:104) at
>>>>>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe
>>>>>>>> cordRe ader.java:79) at
>>>>>>>> org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR
>>>>>>>> eader. java:112) at
>>>>>>>> org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor
>>>>>>>> dReade r.java:130) at
>>>>>>>> org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector
>>>>>>>> (Compo siteRecordReader.java:398) at
>>>>>>>> org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
>>>>>>>> java:5 6) at
>>>>>>>> org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
>>>>>>>> java:3 3) at
>>>>>>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
>>>>>>>> a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at
>>>>>>>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209
>>>>>>>> )
>>>>>>>> 
>>>>>>>> 
>>>>>>>> how can I correct this?
>>>>>>>> thanks.
>>>>>>>> Xu
>> 
>> 


Reply via email to