Re: DataXceiver error

Raghu Angadi Thu, 24 Sep 2009 18:36:33 -0700

This exception is not related to max.xceivers.. though they areco-related. Users who need a lot of xceivers tend to slow readers(nothing wrong with that). And absolutely no relation to handler count.

Is the exception actually resulting in task/job failures? If yes, with0.19, your only option is to set the timeout to 0 as Amandeep suggested.

In 0.20 clients recover correctly from such errors. The failures becauseof this exception should go away.


Amandeep, you should need to set it to 0 if you are 0.20 based HBase.

Raghu.

Florian Leibert wrote:

We can't really alter the jobs... This is a rather complex system with our
own DSL for writing jobs so that other departments can use our data. The
number of mappers is determined based on the number of input files
involved...

Setting this to 0 in a cluster where resources will be scarce at times
doesn't really sound like a solution - I don't have any of these problems on
our 30 node test cluster, so I can't really try it out there and setting the
timeout to 0 on production doesn't give me a great deal of confidence...


On Thu, Sep 24, 2009 at 3:48 PM, Amandeep Khurana <[email protected]> wrote:

On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert <[email protected]> wrote:

This happens maybe 4-5 times a day on an arbitrary node - it usually

occurs

during very intense jobs where there are 10s of thousands of map tasks
scheduled...

Right.. So, the reason most probably is that the particular file being read
is being kept open during the computation and thats causing the timeouts.
You can try to alter your jobs and number of tasks and see if you can come
out with a workaround.

From what I gather in the code, this results from a write attempt - the
selector seems to wait until it can write to a channel - setting this to

might impact our cluster reliability, hence I'm not

Setting the timeout to 0 doesnt impact the cluster reliability. We have it
set to 0 on our clusters as well and its a pretty normal thing to do.
However, we do it because we are using HBase as well and that is known to
keep file handles open for long periods. But, setting the timeout to 0
doesnt impact any of our non-Hbase applications/jobs at all.. So, its not a
problem.

On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <[email protected]>
wrote:

What were you doing when you got this error? Did you monitor the

resource

consumption during whatever you were doing?

Reason I said was that sometimes, file handles are open for longer than

the

timeout for some reason (intended though) and that causes trouble.. So,
people keep the timeout at 0 to solve this problem.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <[email protected]>

wrote:

I don't think setting the timeout to 0 is a good idea - after all we

have

lot writes going on so it should happen at times that a resource

isn't

available immediately. Am I missing something or what's your

reasoning

for

assuming that the timeout value is the problem?

On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <[email protected]>
wrote:

When do you get this error?

Try making the timeout to 0. That'll remove the timeout of 480s.

Property

name: dfs.datanode.socket.write.timeout

-ak



Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <[email protected]>

wrote:

Hi,
recently, we're seeing frequent STEs in our datanodes. We had

prior

fixed

this issue by upping the handler count max.xciever (note this is

misspelled

in the code as well - so we're just being consistent).
We're using 0.19 with a couple of patches - none of which should

affect

any

of the areas in the stacktrace.

We've seen this before upping the limits on the xcievers - but

these

settings seem very high already. We're running 102 nodes.

Any hints would be appreciated.

 <property>
   <name>dfs.datanode.handler.count</name>
   <value>300</value>
</property>
<property>
  <name>dfs.namenode.handler.count</name>
   <value>300</value>
 </property>
 <property>
   <name>dfs.datanode.max.xcievers</name>
   <value>2000</value>
 </property>


2009-09-24 17:48:13,648 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:

DatanodeRegistration(

10.16.160.79:50010,
storageID=DS-1662533511-10.16.160.79-50010-1219665628349,

infoPort=50075,

ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while

waiting

for

channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/

10.16.160.79:50010

remote=/

10.16.134.78:34280]
       at

org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)

at

org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)

at

org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)

at

org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)

at

org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)

at

org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)

at

org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)

       at java.lang.Thread.run(Thread.java:619)

Re: DataXceiver error

Reply via email to