You are running out of file handles on the namenode.  When this
happens, the namenode cannot receive heartbeats from datanodes because
these heartbeats arrive on a tcp/ip socket connection and the namenode
does not have any free file descriptors to accept these socket
connections. Your data is still safe with the datanodes. If you
increase the number of handles on the namenode, all datanodes will
re-join the cluster and things should be fine.

what OS platform is the namenode running on?

thanks,
dhruba

On Sun, Jun 15, 2008 at 5:47 AM, Murali Krishna <[EMAIL PROTECTED]> wrote:
> Hi,
>
>            I was running some M/R job on a 90+ node cluster. While the
> job was running the entire data nodes seems to have become dead. Only
> major error I saw in the name node log is 'java.io.IOException: Too many
> open files'. The job might try to open thousands of file.
>
>            After some time, there are lot of exceptions saying 'could
> only be replicated to 0 nodes instead of 1'. So looks like all the data
> nodes are not responding now; job has failed since it couldn't write. I
> can see the following in the data nodes logs:
>
>            2008-06-15 02:38:28,477 WARN org.apache.hadoop.dfs.DataNode:
> java.net.SocketTimeoutException: timed out waiting for rpc response
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:484)
>
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
>
>        at org.apache.hadoop.dfs.$Proxy0.sendHeartbeat(Unknown Source)
>
>
>
> All processes (datanodes + namenodes) are still running..(dfs health
> status page shows all nodes as dead)
>
>
>
> Some questions:
>
> *         Is this kind of behavior expected when name node runs out of
> file handles?
>
> *         Why the data nodes are not able to send the heart beat (is it
> related to name node not having enough handles?)
>
> *         What happens to the data in the hdfs when all the data nodes
> fail to send the heart beat and name node is in this state?
>
> *         Is the solution is to just increase the number of file handles
> and restart the cluster?
>
>
>
> Thanks,
>
> Murali
>
>

Reply via email to