Re: Socket closed Exception

lohit Wed, 01 Apr 2009 16:00:41 -0700

Thanks Koji, Raghu.
This seemed to solve our problem, havent seen this happen in the past 2 days. 
What is the typical value of ipc.client.idlethreshold on big clusters.
Does default value of 4000 suffice?

Lohit

----- Original Message ----
From: Koji Noguchi <[email protected]>
To: [email protected]
Sent: Monday, March 30, 2009 9:30:04 AM
Subject: RE: Socket closed Exception

Lohit, 

You're right. We saw " java.net.SocketTimeoutException: timed out
waiting for rpc response" and not Socket closed exception.

If you're getting "closed exception", then I don't remember seeing that
problem on our clusters.

Our users often report "Socket closed exception" as a problem, but in
most cases those failures are due to jobs failing with completely
different reasons and race condition between 1) JobTracker removing
directory/killing tasks and 2) tasks failing with closed exception
before they get killed.

Koji

-----Original Message-----
From: lohit [mailto:[email protected]] 
Sent: Monday, March 30, 2009 8:51 AM
To: [email protected]
Subject: Re: Socket closed Exception

Thanks Koji. 
If I look at the code, NameNode (RPC Server) seems to tear down idle
connections. Did you see 'Socket closed' exception instead of 'timed out
waiting for socket'?
We seem to hit the 'Socket closed' exception where client do not
timeout, but get back socket closed exception when they do RPC for
create/open/getFileInfo.

I will give this a try. Thanks again,
Lohit

----- Original Message ----
From: Koji Noguchi <[email protected]>
To: [email protected]
Sent: Sunday, March 29, 2009 11:44:29 PM
Subject: RE: Socket closed Exception

Hi Lohit,

My initial guess would be
https://issues.apache.org/jira/browse/HADOOP-4040

When this happened on our 0.17 cluster, all of our (task) clients were
using the max idle time of 1 hour due to this bug instead of the
configured value of a few seconds.
Thus each client kept the connection up much longer than we expected.
(Not sure if this applies to your 0.15 cluster, but it sounds similar to
what we observed.)

This worked until namenode started hitting the max limit of '
ipc.client.idlethreshold'.  

  <name>ipc.client.idlethreshold</name>
  <value>4000</value>
  <description>Defines the threshold number of connections after which
               connections will be inspected for idleness.
  </description>

When inspecting for idleness, namenode uses

  <name>ipc.client.maxidletime</name>
  <value>120000</value>
  <description>Defines the maximum idle time for a connected client 
               after which it may be disconnected.
  </description>

As a result, many connections got disconnected at once.
Clients only see the timeouts when they try to re-use that sockets the
next time and wait for 1 minute.  That's why they are not exactly at the
same time, but *almost* the same time.

# If this solves your problem, Raghu should get the credit. 
  He spent so many hours to solve this mystery for us. :)

Koji

-----Original Message-----
From: lohit [mailto:[email protected]] 
Sent: Sunday, March 29, 2009 11:56 AM
To: [email protected]
Subject: Socket closed Exception

Recently we are seeing lot of Socket closed exception in our cluster.
Many task's open/create/getFileInfo calls get back 'SocketException'
with message 'Socket closed'. We seem to see many tasks fail with same
error around same time. There are no warning or info messages in
NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases
where NameNode closes socket due heavy load or during conention of
resource of anykind?

Thanks,
Lohit

Re: Socket closed Exception

Reply via email to