On Thu, Sep 24, 2009 at 6:28 PM, Raghu Angadi <[email protected]> wrote:
> > This exception is not related to max.xceivers.. though they are co-related. > Users who need a lot of xceivers tend to slow readers (nothing wrong with > that). And absolutely no relation to handler count. > > Is the exception actually resulting in task/job failures? If yes, with > 0.19, your only option is to set the timeout to 0 as Amandeep suggested. > > In 0.20 clients recover correctly from such errors. The failures because of > this exception should go away. > > Amandeep, you should need to set it to 0 if you are 0.20 based HBase. > > I should/shouldnt? I'm on 0.20 and have it set to 0... It just avoids the exception altogether and doesnt hurt the performance in any ways (I think so..).. Correct me if I'm wrong on this. > Raghu. > > > Florian Leibert wrote: > >> We can't really alter the jobs... This is a rather complex system with our >> own DSL for writing jobs so that other departments can use our data. The >> number of mappers is determined based on the number of input files >> involved... >> >> Setting this to 0 in a cluster where resources will be scarce at times >> doesn't really sound like a solution - I don't have any of these problems >> on >> our 30 node test cluster, so I can't really try it out there and setting >> the >> timeout to 0 on production doesn't give me a great deal of confidence... >> >> >> On Thu, Sep 24, 2009 at 3:48 PM, Amandeep Khurana <[email protected]> >> wrote: >> >> On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert <[email protected]> wrote: >>> >>> This happens maybe 4-5 times a day on an arbitrary node - it usually >>>> >>> occurs >>> >>>> during very intense jobs where there are 10s of thousands of map tasks >>>> scheduled... >>>> >>>> Right.. So, the reason most probably is that the particular file being >>> read >>> is being kept open during the computation and thats causing the timeouts. >>> You can try to alter your jobs and number of tasks and see if you can >>> come >>> out with a workaround. >>> >>> >>> From what I gather in the code, this results from a write attempt - the >>>> selector seems to wait until it can write to a channel - setting this to >>>> >>> 0 >>> >>>> might impact our cluster reliability, hence I'm not >>>> >>>> >>>> Setting the timeout to 0 doesnt impact the cluster reliability. We have >>> it >>> set to 0 on our clusters as well and its a pretty normal thing to do. >>> However, we do it because we are using HBase as well and that is known to >>> keep file handles open for long periods. But, setting the timeout to 0 >>> doesnt impact any of our non-Hbase applications/jobs at all.. So, its not >>> a >>> problem. >>> >>> >>> On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <[email protected]> >>>> wrote: >>>> >>>> What were you doing when you got this error? Did you monitor the >>>>> >>>> resource >>> >>>> consumption during whatever you were doing? >>>>> >>>>> Reason I said was that sometimes, file handles are open for longer than >>>>> >>>> the >>>> >>>>> timeout for some reason (intended though) and that causes trouble.. So, >>>>> people keep the timeout at 0 to solve this problem. >>>>> >>>>> >>>>> Amandeep Khurana >>>>> Computer Science Graduate Student >>>>> University of California, Santa Cruz >>>>> >>>>> >>>>> On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <[email protected]> >>>>> >>>> wrote: >>> >>>> I don't think setting the timeout to 0 is a good idea - after all we >>>>>> >>>>> have >>>> >>>>> a >>>>> >>>>>> lot writes going on so it should happen at times that a resource >>>>>> >>>>> isn't >>> >>>> available immediately. Am I missing something or what's your >>>>>> >>>>> reasoning >>> >>>> for >>>>> >>>>>> assuming that the timeout value is the problem? >>>>>> >>>>>> On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <[email protected]> >>>>>> wrote: >>>>>> >>>>>> When do you get this error? >>>>>>> >>>>>>> Try making the timeout to 0. That'll remove the timeout of 480s. >>>>>>> >>>>>> Property >>>>> >>>>>> name: dfs.datanode.socket.write.timeout >>>>>>> >>>>>>> -ak >>>>>>> >>>>>>> >>>>>>> >>>>>>> Amandeep Khurana >>>>>>> Computer Science Graduate Student >>>>>>> University of California, Santa Cruz >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <[email protected]> >>>>>>> >>>>>> wrote: >>>>> >>>>>> Hi, >>>>>>>> recently, we're seeing frequent STEs in our datanodes. We had >>>>>>>> >>>>>>> prior >>> >>>> fixed >>>>>> >>>>>>> this issue by upping the handler count max.xciever (note this is >>>>>>>> >>>>>>> misspelled >>>>>>> >>>>>>>> in the code as well - so we're just being consistent). >>>>>>>> We're using 0.19 with a couple of patches - none of which should >>>>>>>> >>>>>>> affect >>>>> >>>>>> any >>>>>>> >>>>>>>> of the areas in the stacktrace. >>>>>>>> >>>>>>>> We've seen this before upping the limits on the xcievers - but >>>>>>>> >>>>>>> these >>>> >>>>> settings seem very high already. We're running 102 nodes. >>>>>>>> >>>>>>>> Any hints would be appreciated. >>>>>>>> >>>>>>>> <property> >>>>>>>> <name>dfs.datanode.handler.count</name> >>>>>>>> <value>300</value> >>>>>>>> </property> >>>>>>>> <property> >>>>>>>> <name>dfs.namenode.handler.count</name> >>>>>>>> <value>300</value> >>>>>>>> </property> >>>>>>>> <property> >>>>>>>> <name>dfs.datanode.max.xcievers</name> >>>>>>>> <value>2000</value> >>>>>>>> </property> >>>>>>>> >>>>>>>> >>>>>>>> 2009-09-24 17:48:13,648 ERROR >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: >>>>>>>> >>>>>>> DatanodeRegistration( >>>>> >>>>>> 10.16.160.79:50010, >>>>>>>> storageID=DS-1662533511-10.16.160.79-50010-1219665628349, >>>>>>>> >>>>>>> infoPort=50075, >>>>>> >>>>>>> ipcPort=50020):DataXceiver >>>>>>>> java.net.SocketTimeoutException: 480000 millis timeout while >>>>>>>> >>>>>>> waiting >>>> >>>>> for >>>>>> >>>>>>> channel to be ready for write. ch : >>>>>>>> java.nio.channels.SocketChannel[connected local=/ >>>>>>>> >>>>>>> 10.16.160.79:50010 >>>> >>>>> remote=/ >>>>>>> >>>>>>>> 10.16.134.78:34280] >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185) >>> >>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) >>> >>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) >>> >>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293) >>> >>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387) >>> >>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179) >>> >>>> at >>>>>>>> >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94) >>> >>>> at java.lang.Thread.run(Thread.java:619) >>>>>>>> >>>>>>>> >> >
