[ https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755075#comment-16755075 ]
Íñigo Goiri commented on HDFS-14230: ------------------------------------ OK, it looks like there's some unanimity here. The only problem I see is the time between retries, if the client is fast enough retrying it may end up going through all the Routers. Is there any sleep time between retries in the regular HA client? [~ferhui], can you switch to returning a RetriableException instead of StandbyException? I think if we just make NoNamenodesAvailableException a RetriableException this should work. Not sure if it's find to surface that there were no namenodes available directly to the client but it's fine with me. > RBF: Throw StandbyException instead of IOException when no namenodes available > ------------------------------------------------------------------------------ > > Key: HDFS-14230 > URL: https://issues.apache.org/jira/browse/HDFS-14230 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3 > Reporter: Fei Hui > Assignee: Fei Hui > Priority: Major > Attachments: HDFS-14230-HDFS-13891.001.patch, > HDFS-14230-HDFS-13891.002.patch > > > Failover usually happens when upgrading namenodes. And there are no active > namenodes within some seconds, Accessing HDFS through router fails at this > moment. This could make jobs failure or hang. Some hive jobs logs are as > follow > {code:java} > 2019-01-03 16:12:08,337 Stage-1 map = 100%, reduce = 100%, Cumulative CPU > 133.33 sec > MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec > Ended Job = job_1542178952162_24411913 > Launching Job 4 out of 6 > Exception in thread "Thread-86" java.lang.RuntimeException: > org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode > available under nameservice Cluster3 > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130) > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130) > {code} > Deep into the code. Maybe we can throw StandbyException when no namenodes > available. Client will fail after some retries -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org