This happens maybe 4-5 times a day on an arbitrary node - it usually occurs during very intense jobs where there are 10s of thousands of map tasks scheduled... >From what I gather in the code, this results from a write attempt - the selector seems to wait until it can write to a channel - setting this to 0 might impact our cluster reliability, hence I'm not
On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <[email protected]> wrote: > What were you doing when you got this error? Did you monitor the resource > consumption during whatever you were doing? > > Reason I said was that sometimes, file handles are open for longer than the > timeout for some reason (intended though) and that causes trouble.. So, > people keep the timeout at 0 to solve this problem. > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <[email protected]> wrote: > > > I don't think setting the timeout to 0 is a good idea - after all we have > a > > lot writes going on so it should happen at times that a resource isn't > > available immediately. Am I missing something or what's your reasoning > for > > assuming that the timeout value is the problem? > > > > On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <[email protected]> > > wrote: > > > > > When do you get this error? > > > > > > Try making the timeout to 0. That'll remove the timeout of 480s. > Property > > > name: dfs.datanode.socket.write.timeout > > > > > > -ak > > > > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <[email protected]> > wrote: > > > > > > > Hi, > > > > recently, we're seeing frequent STEs in our datanodes. We had prior > > fixed > > > > this issue by upping the handler count max.xciever (note this is > > > misspelled > > > > in the code as well - so we're just being consistent). > > > > We're using 0.19 with a couple of patches - none of which should > affect > > > any > > > > of the areas in the stacktrace. > > > > > > > > We've seen this before upping the limits on the xcievers - but these > > > > settings seem very high already. We're running 102 nodes. > > > > > > > > Any hints would be appreciated. > > > > > > > > <property> > > > > <name>dfs.datanode.handler.count</name> > > > > <value>300</value> > > > > </property> > > > > <property> > > > > <name>dfs.namenode.handler.count</name> > > > > <value>300</value> > > > > </property> > > > > <property> > > > > <name>dfs.datanode.max.xcievers</name> > > > > <value>2000</value> > > > > </property> > > > > > > > > > > > > 2009-09-24 17:48:13,648 ERROR > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration( > > > > 10.16.160.79:50010, > > > > storageID=DS-1662533511-10.16.160.79-50010-1219665628349, > > infoPort=50075, > > > > ipcPort=50020):DataXceiver > > > > java.net.SocketTimeoutException: 480000 millis timeout while waiting > > for > > > > channel to be ready for write. ch : > > > > java.nio.channels.SocketChannel[connected local=/10.16.160.79:50010 > > > remote=/ > > > > 10.16.134.78:34280] > > > > at > > > > > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179) > > > > at > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94) > > > > at java.lang.Thread.run(Thread.java:619) > > > > > > > > > >
