[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-06-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871597#comment-16871597
 ] 

Hudson commented on HDFS-14230:
---

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16813 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/16813/])
HDFS-14230. RBF: Throw RetriableException instead of IOException when no 
(brahma: rev 7e63e37dc5cbe330082a6a42598ffb76e0770fc1)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/FederationTestUtils.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcMonitor.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/FederationRPCMetrics.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/FederationRPCPerformanceMonitor.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/router/TestRouterRPCClientRetries.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/FederationRPCMBean.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/router/TestRouterClientRejectOverload.java
* (add) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/NoNamenodesAvailableException.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcClient.java


> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Fix For: HDFS-13891
>
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch, 
> HDFS-14230-HDFS-13891.006.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-12 Thread Brahma Reddy Battula (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766866#comment-16766866
 ] 

Brahma Reddy Battula commented on HDFS-14230:
-

My late +1.

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Fix For: HDFS-13891
>
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch, 
> HDFS-14230-HDFS-13891.006.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. Client will 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-12 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766319#comment-16766319
 ] 

Íñigo Goiri commented on HDFS-14230:


Thanks [~ferhui] for the patch and [~crh] for the review.
Committed to the branch.

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch, 
> HDFS-14230-HDFS-13891.006.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-12 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766315#comment-16766315
 ] 

Íñigo Goiri commented on HDFS-14230:


Thanks [~crh] for the review.
If nobody else has comments I'll commit today.

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch, 
> HDFS-14230-HDFS-13891.006.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-12 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766308#comment-16766308
 ] 

CR Hota commented on HDFS-14230:


Thanks [~ferhui]. v006 patch LGTM

+1

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch, 
> HDFS-14230-HDFS-13891.006.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. Client will fail after some retries



--
This 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-12 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765957#comment-16765957
 ] 

Hadoop QA commented on HDFS-14230:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
30s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} HDFS-13891 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 5s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
30s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  9s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
55s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
35s{color} | {color:green} HDFS-13891 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 30s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 23m 
38s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 79m 13s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | HDFS-14230 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958370/HDFS-14230-HDFS-13891.006.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux d74852ed5aef 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | HDFS-13891 / ecd90a6 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26192/testReport/ |
| Max. process+thread count | 971 (vs. ulimit of 1) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26192/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-12 Thread Fei Hui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765904#comment-16765904
 ] 

Fei Hui commented on HDFS-14230:


[~crh] Thanks for your comments.
Upload v006 patch. Remove try/catch and assert exceptions using 
ExpectedException rules.


> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch, 
> HDFS-14230-HDFS-13891.006.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-11 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765734#comment-16765734
 ] 

CR Hota commented on HDFS-14230:


[~ferhui] Thanks for working on the patch.

Can we assert exceptions using ExpectedException rules? It makes it easy to 
understand the code without verbose try/catch. Look at 
TestConnectionManager.testGetConnectionWithException.

Other than that, LGTM.

 

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-11 Thread Fei Hui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765640#comment-16765640
 ] 

Fei Hui commented on HDFS-14230:


Ping again [~elgoiri][~crh][~brahmareddy]

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. Client will fail after some retries



--
This message was sent by Atlassian JIRA

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-05 Thread Fei Hui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760664#comment-16760664
 ] 

Fei Hui commented on HDFS-14230:


[~elgoiri] [~crh] [~brahmareddy] Any suggestions? Thanks.

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. Client will fail after some retries



--
This message was sent 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-01 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758534#comment-16758534
 ] 

Íñigo Goiri commented on HDFS-14230:


[^HDFS-14230-HDFS-13891.005.patch] LGTM.
+1 pending Yetus.
Anybody else wants to take a look at it?

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. Client will 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-02-01 Thread Fei Hui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758355#comment-16758355
 ] 

Fei Hui commented on HDFS-14230:


[~elgoiri] Could you please take look again? Thanks

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. Client will fail after some retries



--
This message was sent by 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-01-31 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757950#comment-16757950
 ] 

Hadoop QA commented on HDFS-14230:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
51s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} HDFS-13891 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
13s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
31s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
41s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 53s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
8s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} HDFS-13891 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 21s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m  
7s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 89m 13s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | HDFS-14230 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12957187/HDFS-14230-HDFS-13891.005.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 136d322b8659 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | HDFS-13891 / d37590b |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26107/testReport/ |
| Max. process+thread count | 1077 (vs. ulimit of 1) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26107/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-01-31 Thread Fei Hui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757903#comment-16757903
 ] 

Fei Hui commented on HDFS-14230:


Upload v005 patch
* Fix checkstyle
* Fix TestRouterClientRejectOverload#testNoNamenodesAvailable, Update namenode 
heartbeat
* Fix TestRouterRPCClientRetries#testRetryWhenAllNameServiceDown, change 
expected string

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch, HDFS-14230-HDFS-13891.005.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-01-31 Thread Fei Hui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757096#comment-16757096
 ] 

Fei Hui commented on HDFS-14230:


Checking failed Tests

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. Client will fail after some retries



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-01-31 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757052#comment-16757052
 ] 

Hadoop QA commented on HDFS-14230:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} HDFS-13891 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
14s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
28s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 58s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
50s{color} | {color:green} HDFS-13891 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} HDFS-13891 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 16s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs-rbf: The patch 
generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 23s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 23m 19s{color} 
| {color:red} hadoop-hdfs-rbf in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 77m 25s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.federation.router.TestRouterClientRejectOverload |
|   | hadoop.hdfs.server.federation.router.TestRouterRPCClientRetries |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | HDFS-14230 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12957023/HDFS-14230-HDFS-13891.004.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux cea0e2b40729 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | HDFS-13891 / d37590b |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26099/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26099/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt
 

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-01-31 Thread Fei Hui (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756991#comment-16756991
 ] 

Fei Hui commented on HDFS-14230:


[~elgoiri]Thanks for your detailed proposal.
Upload v004 patch.  

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch, 
> HDFS-14230-HDFS-13891.004.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. Client will fail after some retries



--
This message was sent by Atlassian JIRA

[jira] [Commented] (HDFS-14230) RBF: Throw RetriableException instead of IOException when no namenodes available

2019-01-30 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756387#comment-16756387
 ] 

Íñigo Goiri commented on HDFS-14230:


Thanks [~ferhui] for switching to retriable.
Some comments:
* Make NoNamenodesAvailableException have a constructor which only takes the 
nsId and the original ioe. In that way you can do {{new 
NoNamenodesAvailableException(nsId, ioe);}} and generate the message inside.
* I think you could just create the {{RetriableException}} from it: {{new 
RetriableException(ioe);}}
* Add a counter to track this failure.
* Add a javadoc to the new test explaining you are simulating a failover.
* I think that the mini dfs cluster already offer ways to do the transition to 
active/standby.
* Can you make the code that you are using to switch the NN to active into a 
function? Then probably you can use some lambda fanciness to start the function 
in a less verbose way. BTW, not a big fan of sleeping 10 seconds. We may want 
to do something smarter. Actually you can try with all standby, check counters, 
put one as active and check how it succeeds.
* Check for the new counter added according to my previous comment.

> RBF: Throw RetriableException instead of IOException when no namenodes 
> available
> 
>
> Key: HDFS-14230
> URL: https://issues.apache.org/jira/browse/HDFS-14230
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch, HDFS-14230-HDFS-13891.003.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
> at 
>