Thanks Koji, Raghu. This seemed to solve our problem, havent seen this happen in the past 2 days. What is the typical value of ipc.client.idlethreshold on big clusters. Does default value of 4000 suffice?
Lohit ----- Original Message ---- From: Koji Noguchi <[email protected]> To: [email protected] Sent: Monday, March 30, 2009 9:30:04 AM Subject: RE: Socket closed Exception Lohit, You're right. We saw " java.net.SocketTimeoutException: timed out waiting for rpc response" and not Socket closed exception. If you're getting "closed exception", then I don't remember seeing that problem on our clusters. Our users often report "Socket closed exception" as a problem, but in most cases those failures are due to jobs failing with completely different reasons and race condition between 1) JobTracker removing directory/killing tasks and 2) tasks failing with closed exception before they get killed. Koji -----Original Message----- From: lohit [mailto:[email protected]] Sent: Monday, March 30, 2009 8:51 AM To: [email protected] Subject: Re: Socket closed Exception Thanks Koji. If I look at the code, NameNode (RPC Server) seems to tear down idle connections. Did you see 'Socket closed' exception instead of 'timed out waiting for socket'? We seem to hit the 'Socket closed' exception where client do not timeout, but get back socket closed exception when they do RPC for create/open/getFileInfo. I will give this a try. Thanks again, Lohit ----- Original Message ---- From: Koji Noguchi <[email protected]> To: [email protected] Sent: Sunday, March 29, 2009 11:44:29 PM Subject: RE: Socket closed Exception Hi Lohit, My initial guess would be https://issues.apache.org/jira/browse/HADOOP-4040 When this happened on our 0.17 cluster, all of our (task) clients were using the max idle time of 1 hour due to this bug instead of the configured value of a few seconds. Thus each client kept the connection up much longer than we expected. (Not sure if this applies to your 0.15 cluster, but it sounds similar to what we observed.) This worked until namenode started hitting the max limit of ' ipc.client.idlethreshold'. <name>ipc.client.idlethreshold</name> <value>4000</value> <description>Defines the threshold number of connections after which connections will be inspected for idleness. </description> When inspecting for idleness, namenode uses <name>ipc.client.maxidletime</name> <value>120000</value> <description>Defines the maximum idle time for a connected client after which it may be disconnected. </description> As a result, many connections got disconnected at once. Clients only see the timeouts when they try to re-use that sockets the next time and wait for 1 minute. That's why they are not exactly at the same time, but *almost* the same time. # If this solves your problem, Raghu should get the credit. He spent so many hours to solve this mystery for us. :) Koji -----Original Message----- From: lohit [mailto:[email protected]] Sent: Sunday, March 29, 2009 11:56 AM To: [email protected] Subject: Socket closed Exception Recently we are seeing lot of Socket closed exception in our cluster. Many task's open/create/getFileInfo calls get back 'SocketException' with message 'Socket closed'. We seem to see many tasks fail with same error around same time. There are no warning or info messages in NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due heavy load or during conention of resource of anykind? Thanks, Lohit
