[
https://issues.apache.org/jira/browse/HDFS-5555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837092#comment-13837092
]
Jimmy Xiang commented on HDFS-5555:
-----------------------------------
DFSClient uses RetryInvocationHandler in this case while the iterator uses
ProtobufRpcEngine#Invoker which doesn't handle failover. One fix is to throw
StandbyException at the beginning instead of returning an iterator. The other
fix is to make sure the iterator supports failover as well. If we throw
StandbyException at the beginning, the issue will comes up again when iterating
the iterator while NN fails over in the middle.
> CacheAdmin commands fail when first listed NameNode is in Standby
> -----------------------------------------------------------------
>
> Key: HDFS-5555
> URL: https://issues.apache.org/jira/browse/HDFS-5555
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: caching
> Affects Versions: 3.0.0
> Reporter: Stephen Chu
> Assignee: Jimmy Xiang
>
> I am on a HA-enabled cluster. The NameNodes are on host-1 and host-2.
> In the configurations, we specify the host-1 NN first and the host-2 NN
> afterwards in the _dfs.ha.namenodes.ns1_ property (where _ns1_ is the name of
> the nameservice).
> If the host-1 NN is Standby and the host-2 NN is Active, some CacheAdmins
> will fail complaining about operation not supported in standby state.
> e.g.
> {code}
> bash-4.1$ hdfs cacheadmin -removeDirectives -path /user/hdfs2
> Exception in thread "main"
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1501)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1082)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.listCacheDirectives(FSNamesystem.java:6892)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer$ServerSideCacheEntriesIterator.makeRequest(NameNodeRpcServer.java:1263)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer$ServerSideCacheEntriesIterator.makeRequest(NameNodeRpcServer.java:1249)
> at
> org.apache.hadoop.fs.BatchedRemoteIterator.makeRequest(BatchedRemoteIterator.java:77)
> at
> org.apache.hadoop.fs.BatchedRemoteIterator.makeRequestIfNeeded(BatchedRemoteIterator.java:85)
> at
> org.apache.hadoop.fs.BatchedRemoteIterator.hasNext(BatchedRemoteIterator.java:99)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.listCacheDirectives(ClientNamenodeProtocolServerSideTranslatorPB.java:1087)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1499)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> at org.apache.hadoop.ipc.Client.call(Client.java:1348)
> at org.apache.hadoop.ipc.Client.call(Client.java:1301)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy9.listCacheDirectives(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB$CacheEntriesIterator.makeRequest(ClientNamenodeProtocolTranslatorPB.java:1079)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB$CacheEntriesIterator.makeRequest(ClientNamenodeProtocolTranslatorPB.java:1064)
> at
> org.apache.hadoop.fs.BatchedRemoteIterator.makeRequest(BatchedRemoteIterator.java:77)
> at
> org.apache.hadoop.fs.BatchedRemoteIterator.makeRequestIfNeeded(BatchedRemoteIterator.java:85)
> at
> org.apache.hadoop.fs.BatchedRemoteIterator.hasNext(BatchedRemoteIterator.java:99)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$32.hasNext(DistributedFileSystem.java:1704)
> at
> org.apache.hadoop.hdfs.tools.CacheAdmin$RemoveCacheDirectiveInfosCommand.run(CacheAdmin.java:372)
> at org.apache.hadoop.hdfs.tools.CacheAdmin.run(CacheAdmin.java:84)
> at org.apache.hadoop.hdfs.tools.CacheAdmin.main(CacheAdmin.java:89)
> {code}
> After manually failing over from host-2 to host-1, the CacheAdmin commands
> succeed.
> The affected commands are:
> -listPools
> -listDirectives
> -removeDirectives
--
This message was sent by Atlassian JIRA
(v6.1#6144)