Re: DataXceiver error

Amandeep Khurana Thu, 24 Sep 2009 18:32:54 -0700

On Thu, Sep 24, 2009 at 6:28 PM, Raghu Angadi <[email protected]> wrote:


>
> This exception is not related to max.xceivers.. though they are co-related.
> Users who need a lot of xceivers tend to slow readers (nothing wrong with
> that). And absolutely no relation to handler count.
>
> Is the exception actually resulting in task/job failures? If yes, with
> 0.19, your only option is to set the timeout to 0 as Amandeep suggested.
>
> In 0.20 clients recover correctly from such errors. The failures because of
> this exception should go away.
>
> Amandeep, you should need to set it to 0 if you are 0.20 based HBase.
>
>
I should/shouldnt? I'm on 0.20 and have it set to 0... It just avoids the
exception altogether and doesnt hurt the performance in any ways (I think
so..).. Correct me if I'm wrong on this.



> Raghu.
>
>
> Florian Leibert wrote:
>
>> We can't really alter the jobs... This is a rather complex system with our
>> own DSL for writing jobs so that other departments can use our data. The
>> number of mappers is determined based on the number of input files
>> involved...
>>
>> Setting this to 0 in a cluster where resources will be scarce at times
>> doesn't really sound like a solution - I don't have any of these problems
>> on
>> our 30 node test cluster, so I can't really try it out there and setting
>> the
>> timeout to 0 on production doesn't give me a great deal of confidence...
>>
>>
>> On Thu, Sep 24, 2009 at 3:48 PM, Amandeep Khurana <[email protected]>
>> wrote:
>>
>>  On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert <[email protected]> wrote:
>>>
>>>  This happens maybe 4-5 times a day on an arbitrary node - it usually
>>>>
>>> occurs
>>>
>>>> during very intense jobs where there are 10s of thousands of map tasks
>>>> scheduled...
>>>>
>>>>  Right.. So, the reason most probably is that the particular file being
>>> read
>>> is being kept open during the computation and thats causing the timeouts.
>>> You can try to alter your jobs and number of tasks and see if you can
>>> come
>>> out with a workaround.
>>>
>>>
>>>  From what I gather in the code, this results from a write attempt - the
>>>> selector seems to wait until it can write to a channel - setting this to
>>>>
>>> 0
>>>
>>>> might impact our cluster reliability, hence I'm not
>>>>
>>>>
>>>>  Setting the timeout to 0 doesnt impact the cluster reliability. We have
>>> it
>>> set to 0 on our clusters as well and its a pretty normal thing to do.
>>> However, we do it because we are using HBase as well and that is known to
>>> keep file handles open for long periods. But, setting the timeout to 0
>>> doesnt impact any of our non-Hbase applications/jobs at all.. So, its not
>>> a
>>> problem.
>>>
>>>
>>>  On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <[email protected]>
>>>> wrote:
>>>>
>>>>  What were you doing when you got this error? Did you monitor the
>>>>>
>>>> resource
>>>
>>>> consumption during whatever you were doing?
>>>>>
>>>>> Reason I said was that sometimes, file handles are open for longer than
>>>>>
>>>> the
>>>>
>>>>> timeout for some reason (intended though) and that causes trouble.. So,
>>>>> people keep the timeout at 0 to solve this problem.
>>>>>
>>>>>
>>>>> Amandeep Khurana
>>>>> Computer Science Graduate Student
>>>>> University of California, Santa Cruz
>>>>>
>>>>>
>>>>> On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <[email protected]>
>>>>>
>>>> wrote:
>>>
>>>>  I don't think setting the timeout to 0 is a good idea - after all we
>>>>>>
>>>>> have
>>>>
>>>>> a
>>>>>
>>>>>> lot writes going on so it should happen at times that a resource
>>>>>>
>>>>> isn't
>>>
>>>>  available immediately. Am I missing something or what's your
>>>>>>
>>>>> reasoning
>>>
>>>> for
>>>>>
>>>>>> assuming that the timeout value is the problem?
>>>>>>
>>>>>> On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>  When do you get this error?
>>>>>>>
>>>>>>> Try making the timeout to 0. That'll remove the timeout of 480s.
>>>>>>>
>>>>>> Property
>>>>>
>>>>>> name: dfs.datanode.socket.write.timeout
>>>>>>>
>>>>>>> -ak
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Amandeep Khurana
>>>>>>> Computer Science Graduate Student
>>>>>>> University of California, Santa Cruz
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <[email protected]>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>>  Hi,
>>>>>>>> recently, we're seeing frequent STEs in our datanodes. We had
>>>>>>>>
>>>>>>> prior
>>>
>>>>  fixed
>>>>>>
>>>>>>> this issue by upping the handler count max.xciever (note this is
>>>>>>>>
>>>>>>> misspelled
>>>>>>>
>>>>>>>> in the code as well - so we're just being consistent).
>>>>>>>> We're using 0.19 with a couple of patches - none of which should
>>>>>>>>
>>>>>>> affect
>>>>>
>>>>>> any
>>>>>>>
>>>>>>>> of the areas in the stacktrace.
>>>>>>>>
>>>>>>>> We've seen this before upping the limits on the xcievers - but
>>>>>>>>
>>>>>>> these
>>>>
>>>>>  settings seem very high already. We're running 102 nodes.
>>>>>>>>
>>>>>>>> Any hints would be appreciated.
>>>>>>>>
>>>>>>>>  <property>
>>>>>>>>   <name>dfs.datanode.handler.count</name>
>>>>>>>>   <value>300</value>
>>>>>>>> </property>
>>>>>>>> <property>
>>>>>>>>  <name>dfs.namenode.handler.count</name>
>>>>>>>>   <value>300</value>
>>>>>>>>  </property>
>>>>>>>>  <property>
>>>>>>>>   <name>dfs.datanode.max.xcievers</name>
>>>>>>>>   <value>2000</value>
>>>>>>>>  </property>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2009-09-24 17:48:13,648 ERROR
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>>>>>>
>>>>>>> DatanodeRegistration(
>>>>>
>>>>>>  10.16.160.79:50010,
>>>>>>>> storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
>>>>>>>>
>>>>>>> infoPort=50075,
>>>>>>
>>>>>>> ipcPort=50020):DataXceiver
>>>>>>>> java.net.SocketTimeoutException: 480000 millis timeout while
>>>>>>>>
>>>>>>> waiting
>>>>
>>>>> for
>>>>>>
>>>>>>> channel to be ready for write. ch :
>>>>>>>> java.nio.channels.SocketChannel[connected local=/
>>>>>>>>
>>>>>>> 10.16.160.79:50010
>>>>
>>>>>  remote=/
>>>>>>>
>>>>>>>> 10.16.134.78:34280]
>>>>>>>>       at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>>>
>>>>        at
>>>>>>>>
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>>>
>>>>        at java.lang.Thread.run(Thread.java:619)
>>>>>>>>
>>>>>>>>
>>
>

Re: DataXceiver error

Reply via email to