zhihai xu created HDFS-7801: ------------------------------- Summary: "IOException:NameNode still not started" cause DFSClient operation failure without retry. Key: HDFS-7801 URL: https://issues.apache.org/jira/browse/HDFS-7801 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client, namenode Reporter: zhihai xu
"IOException:NameNode still not started" cause DFSClient operation failure without retry. In YARN-1778, TestFSRMStateStore failed randomly, it is due to the "java.io.IOException: NameNode still not started". The stack trace for this Exception is the following: {code} 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128) at org.apache.hadoop.ipc.Client.call(Client.java:1474) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy23.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:557) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy24.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2991) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2961) at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:973) at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:969) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:969) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:962) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1869) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:364) at org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore$2.run(TestFSRMStateStore.java:273) 2015-02-03 00:09:19,089 INFO [IPC Server handler 0 on 57792] ipc.Server (Server.java:run(2155)) - IPC Server handler 0 on 57792, call org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 127.0.0.1:57805 Call#14 Retry#1 java.io.IOException: NameNode still not started at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128) {code} the reason for this random error is The NameNode constructor [set started flag at the end|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L826]. And it starts [NameNodeRpcServer|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L685] by calling function initialize before started flag is set. If the client (which try to call mkdirs) connects to NameNode server before started flag is set, the java.io.IOException: "NameNode still not started" will happen, then the test will fail. If the client connects to NameNode server after started flag is set, the test will succeed. As discussed in YARN-1778, there are two ways to fix this issue in HDFS. 1. reorder the code in NameNode constructor: move rpcServer.start to the end after started flag is set. 2. doing retry in DFSClient for IOException:NameNode still not started. We can create a new RetryPolicy to do retry for this exception. We need to discuss what is the correct way to fix this issue or we don’t need to fix this issue if we can guarantee the DFSClient always starts after NameNode in real world. -- This message was sent by Atlassian JIRA (v6.3.4#6332)