I'll add a comment to Jira. I haven't tried the latest version of the patch
yet, but since it's only changes the dfs client, not the datanode, I don't
see how it would help with this.

Two more things I noticed that happen when the datanodes become unresponsive
(i.e. The "Last Contact" field on the namenode keeps increasing) is:

1. The datanode process seem to be completely hung for a while, including
its Jetty web interface, sometimes for over 10 minutes.

2. The task tracker on the same machine keeps humming along, sending regular
heartbeats

To me this looks like there is some sort of temporary deadlock in the
datanode that keeps it from responding to requests. Perhaps it's the block
report being generated ?

-- Stefan

> From: Raghu Angadi <[EMAIL PROTECTED]>
> Reply-To: <[email protected]>
> Date: Tue, 09 Sep 2008 16:40:02 -0700
> To: <[email protected]>
> Subject: Re: Could not obtain block: blk_-2634319951074439134_1129
> file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data
> 
> Espen Amble Kolstad wrote:
>> There's a JIRA on this already:
>> https://issues.apache.org/jira/browse/HADOOP-3831
>> Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems
>> to do the trick for now.
> 
> Please comment on HADOOP-3831 that you are seeing this error.. so that
> it gets committed. Did you try the patch for HADOOP-3831?
> 
> thanks,
> Raghu.
> 
>> Espen
>> 
>> On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad <[EMAIL PROTECTED]> 
>> wrote:
>>> Hi,
>>> 
>>> Thanks for the tip!
>>> 
>>> I tried revision 692572 of the 0.18 branch, but I still get the same errors.
>>> 
>>> On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
>>>> The DFS errors might have been caused by
>>>> 
>>>> http://issues.apache.org/jira/browse/HADOOP-4040
>>>> 
>>>> thanks,
>>>> dhruba
>>>> 
>>>> On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das <[EMAIL PROTECTED]> wrote:
>>>>> These exceptions are apparently coming from the dfs side of things. Could
>>>>> someone from the dfs side please look at these?
>>>>> 
>>>>> On 9/5/08 3:04 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Thanks!
>>>>>> The patch applies without change to hadoop-0.18.0, and should be
>>>>>> included in a 0.18.1.
>>>>>> 
>>>>>> However, I'm still seeing:
>>>>>> in hadoop.log:
>>>>>> 2008-09-05 11:13:54,805 WARN  dfs.DFSClient - Exception while reading
>>>>>> from blk_3428404120239503595_2664 of
>>>>>> /user/trank/segments/20080905102650/crawl_generate/part-00010 from
>>>>>> somehost:50010: java.io.IOException: Premeture EOF from in
>>>>>> putStream
>>>>>> 
>>>>>> in datanode.log:
>>>>>> 2008-09-05 11:15:09,554 WARN  dfs.DataNode -
>>>>>> DatanodeRegistration(somehost:50010,
>>>>>> storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
>>>>>> ipcPort=50020):Got exception while serving
>>>>>> blk_-4682098638573619471_2662 to
>>>>>> /somehost:
>>>>>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
>>>>>> for channel to be ready for write. ch :
>>>>>> java.nio.channels.SocketChannel[connected local=/somehost:50010
>>>>>> remote=/somehost:45244]
>>>>>> 
>>>>>> These entries in datanode.log happens a few minutes apart repeatedly.
>>>>>> I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
>>>>>> free memory (so it's not resource starvation).
>>>>>> 
>>>>>> Espen
>>>>>> 
>>>>>> On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:
>>>>>>>> I started a profile of the reduce-task. I've attached the profiling
>>>>>>>> output. It seems from the samples that ramManager.waitForDataToMerge()
>>>>>>>> doesn't actually wait.
>>>>>>>> Has anybody seen this behavior.
>>>>>>> This has been fixed in HADOOP-3940
>>>>>>> 
>>>>>>> On 9/4/08 6:36 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
>>>>>>>> I have the same problem on our cluster.
>>>>>>>> 
>>>>>>>> It seems the reducer-tasks are using all cpu, long before there's
>>>>>>>> anything to
>>>>>>>> shuffle.
>>>>>>>> 
>>>>>>>> I started a profile of the reduce-task. I've attached the profiling
>>>>>>>> output. It seems from the samples that ramManager.waitForDataToMerge()
>>>>>>>> doesn't actually wait.
>>>>>>>> Has anybody seen this behavior.
>>>>>>>> 
>>>>>>>> Espen
>>>>>>>> 
>>>>>>>> On Thursday 28 August 2008 06:11:42 wangxu wrote:
>>>>>>>>> Hi,all
>>>>>>>>> I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
>>>>>>>>> and running hadoop on one namenode and 4 slaves.
>>>>>>>>> attached is my hadoop-site.xml, and I didn't change the file
>>>>>>>>> hadoop-default.xml
>>>>>>>>> 
>>>>>>>>> when data in segments are large,this kind of errors occure:
>>>>>>>>> 
>>>>>>>>> java.io.IOException: Could not obtain block:
>>>>>>>>> blk_-2634319951074439134_1129
>>>>>>>>> file=/user/root/crawl_debug/segments/20080825053518/content/part-0000
>>>>>>>>> 2/data at
>>>>>>>>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
>>>>>>>>> nt.jav a:1462) at
>>>>>>>>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.
>>>>>>>>> java:1 312) at
>>>>>>>>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14
>>>>>>>>> 17) at java.io.DataInputStream.readFully(DataInputStream.java:178)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j
>>>>>>>>> ava:64 ) at
>>>>>>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102
>>>>>>>>> ) at
>>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java
>>>>>>>>> :1646) at
>>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF
>>>>>>>>> ile.ja va:1712) at
>>>>>>>>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile
>>>>>>>>> .java: 1787) at
>>>>>>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq
>>>>>>>>> uenceF ileRecordReader.java:104) at
>>>>>>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe
>>>>>>>>> cordRe ader.java:79) at
>>>>>>>>> org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR
>>>>>>>>> eader. java:112) at
>>>>>>>>> org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor
>>>>>>>>> dReade r.java:130) at
>>>>>>>>> org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector
>>>>>>>>> (Compo siteRecordReader.java:398) at
>>>>>>>>> org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
>>>>>>>>> java:5 6) at
>>>>>>>>> org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
>>>>>>>>> java:3 3) at
>>>>>>>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
>>>>>>>>> a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at
>>>>>>>>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209
>>>>>>>>> )
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> how can I correct this?
>>>>>>>>> thanks.
>>>>>>>>> Xu
>>> 


Reply via email to