[
https://issues.apache.org/jira/browse/HDFS-15555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188986#comment-17188986
]
Akira Ajisaka commented on HDFS-15555:
--------------------------------------
The following code refreshes the cache:
https://github.com/apache/hadoop/blob/b6a3286d27b604322fddc1ec06ad563fd8a9d0f4/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcClient.java#L424-L428
{{failover}} is set to true when the IOException is in the Unavailable
Exceptions.
https://github.com/apache/hadoop/blob/b6a3286d27b604322fddc1ec06ad563fd8a9d0f4/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcClient.java#L441-L445
> RBF: Refresh cacheNS when SocketException occurs
> ------------------------------------------------
>
> Key: HDFS-15555
> URL: https://issues.apache.org/jira/browse/HDFS-15555
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: rbf
> Environment: HDFS 3.3.0, Java 11
> Reporter: Akira Ajisaka
> Assignee: Akira Ajisaka
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Problem:
> When active NameNode is restarted and loading fsimage, DFSRouters
> significantly slow down.
> Investigation:
> When active NameNode is restarted and loading fsimage, RouterRpcClient
> receives SocketException. Since
> RouterRpcClient#isUnavailableException(IOException) returns false when the
> argument is SocketException, the MembershipNameNodeResolver#cacheNS is not
> refreshed. That's why the order of the NameNodes returned by
> MemberShipNameNodeResolver#getNamenodesForNameserviceId(String) is unchanged
> and the active NameNode is still returned first. Therefore RouterRpcClient
> still tries to connect to the NameNode that is loading fsimage.
> After loading the fsimage, the NameNode throws StandbyException. The
> exception is one of the 'Unavailable Exception' and the cacheNS is refreshed.
> Workaround:
> Stop NameNode and wait 1 minute before starting NameNode instead of
> restarting.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]