zhihai xu created HDFS-7801:
-------------------------------
Summary: "IOException:NameNode still not started" cause DFSClient
operation failure without retry.
Key: HDFS-7801
URL: https://issues.apache.org/jira/browse/HDFS-7801
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs-client, namenode
Reporter: zhihai xu
"IOException:NameNode still not started" cause DFSClient operation failure
without retry.
In YARN-1778, TestFSRMStateStore failed randomly, it is due to the
"java.io.IOException: NameNode still not started".
The stack trace for this Exception is the following:
{code}
2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore
(TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not
started
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy23.mkdirs(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at com.sun.proxy.$Proxy24.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2991)
at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2961)
at
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:973)
at
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:969)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:969)
at
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:962)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1869)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:364)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore$2.run(TestFSRMStateStore.java:273)
2015-02-03 00:09:19,089 INFO [IPC Server handler 0 on 57792] ipc.Server
(Server.java:run(2155)) - IPC Server handler 0 on 57792, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 127.0.0.1:57805
Call#14 Retry#1
java.io.IOException: NameNode still not started
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128)
{code}
the reason for this random error is
The NameNode constructor [set started flag at the
end|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L826].
And it starts
[NameNodeRpcServer|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L685]
by calling function initialize before started flag is set.
If the client (which try to call mkdirs) connects to NameNode server before
started flag is set,
the java.io.IOException: "NameNode still not started" will happen, then the
test will fail.
If the client connects to NameNode server after started flag is set, the test
will succeed.
As discussed in YARN-1778, there are two ways to fix this issue in HDFS.
1. reorder the code in NameNode constructor: move rpcServer.start to the end
after started flag is set.
2. doing retry in DFSClient for IOException:NameNode still not started. We can
create a new RetryPolicy to do retry for this exception.
We need to discuss what is the correct way to fix this issue or
we don’t need to fix this issue if we can guarantee the DFSClient always starts
after NameNode in real world.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)