Hi there,

Apologies if this comes through twice but i sent the mail a few hours
ago and haven't seen it on the mailing list.

I'm experiencing some unusual behaviour on our 0.20.2 hadoop cluster.
Randomly (periodically), we're getting "Call to namenode" failures on
tasktrackers causing tasks to fail:

2011-05-12 14:36:37,462 WARN org.apache.hadoop.mapred.TaskRunner:
attempt_201105090819_059_m_0038_0Child Error
java.io.IOException: Call to namenode/10.10.10.10:9000 failed on local
exception: java.io.EOFException
       at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
       at org.apache.hadoop.ipc.Client.call(Client.java:743)
       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
       at $Proxy5.getFileInfo(Unknown Source)
       at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
       at java.lang.reflect.Method.invoke(Unknown Source)
       at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
       at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
       at $Proxy5.getFileInfo(Unknown Source)
       at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615)
       at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:210)
Caused by: java.io.EOFException
       at java.io.DataInputStream.readInt(Unknown Source)
       at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

The namenode log (logging level = INFO) shows the following a few seconds
either side of the above timestamps. Could be relevant or it could be a
coincidence :

2011-05-12 14:36:40,005 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 57 on 9000 caught: java.nio.channels.ClosedChannelException
       at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
       at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
       at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1213)
       at org.apache.hadoop.ipc.Server.access$1900(Server.java:77)
       at
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:622)
       at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:686)
       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:997)

The jobtracker does however have an entry that correlates with the
tasktracker :

2011-05-12 14:36:39,781 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from attempt_201105090819_059_m_0038_0: java.io.IOException: Call to
namenode/10.10.10.10:9000 failed on local exception: java.io.EOFException
       at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
       at org.apache.hadoop.ipc.Client.call(Client.java:743)
       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
       at $Proxy1.getProtocolVersion(Unknown Source)
       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
       at
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:105)
       at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:208)
       at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:169)
       at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
       at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
       at org.apache.hadoop.mapred.Child.main(Child.java:157)
Caused by: java.io.EOFException
       at java.io.DataInputStream.readInt(Unknown Source)
       at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

Can anyone give me any pointers on how to start troubleshooting this issue?
It's very sporadic and we haven't been able to reproduce the issue yet in
our lab. After looking through the mailing list archives, some of the
suggestions revolve around the following settings:

dfs.namenode.handler.count 128 (existing 64)
dfs.datanode.handler.count 10 (existing 3)
dfs.datanode.max.xcievers 4096 (existing 256)

Any pointers ?

Thanks in advance

Sid Simmons
Infrastructure Support Specialist

Reply via email to