zhihai xu created HDFS-7801:
-------------------------------

             Summary: "IOException:NameNode still not started" cause DFSClient 
operation failure without retry.
                 Key: HDFS-7801
                 URL: https://issues.apache.org/jira/browse/HDFS-7801
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: hdfs-client, namenode
            Reporter: zhihai xu


"IOException:NameNode still not started" cause DFSClient operation failure 
without retry.
In YARN-1778, TestFSRMStateStore failed randomly, it is due to the 
"java.io.IOException: NameNode still not started".
The stack trace for this Exception is the following:
{code}
2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
(TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not 
started
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128)

        at org.apache.hadoop.ipc.Client.call(Client.java:1474)
        at org.apache.hadoop.ipc.Client.call(Client.java:1405)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy23.mkdirs(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:557)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
        at com.sun.proxy.$Proxy24.mkdirs(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2991)
        at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2961)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:973)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:969)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:969)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:962)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1869)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:364)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore$2.run(TestFSRMStateStore.java:273)
2015-02-03 00:09:19,089 INFO  [IPC Server handler 0 on 57792] ipc.Server 
(Server.java:run(2155)) - IPC Server handler 0 on 57792, call 
org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 127.0.0.1:57805 
Call#14 Retry#1
java.io.IOException: NameNode still not started
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128)
{code}
the reason for this random error is
The NameNode constructor [set started flag at the 
end|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L826].
And it starts 
[NameNodeRpcServer|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L685]
 by calling function initialize before started flag is set.
If the client (which try to call mkdirs) connects to NameNode server before 
started flag is set,
the java.io.IOException: "NameNode still not started" will happen, then the 
test will fail.
If the client connects to NameNode server after started flag is set, the test 
will succeed.
As discussed in YARN-1778, there are two ways to fix this issue in HDFS.
1. reorder the code in NameNode constructor: move rpcServer.start to the end 
after started flag is set.
2. doing retry in DFSClient for IOException:NameNode still not started. We can 
create a new RetryPolicy to do retry for this exception.

We need to discuss what is the correct way to fix this issue or
we don’t need to fix this issue if we can guarantee the DFSClient always starts 
after NameNode in real world.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to