[ https://issues.apache.org/jira/browse/HDFS-15555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840305#comment-17840305 ]
chuanjie.duan commented on HDFS-15555: -------------------------------------- [~elgoiri] [~aajisaka] not sure why delete "ioe instanceof ConnectException" > RBF: Refresh cacheNS when SocketException occurs > ------------------------------------------------ > > Key: HDFS-15555 > URL: https://issues.apache.org/jira/browse/HDFS-15555 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf > Affects Versions: 3.3.1, 3.4.0 > Environment: HDFS 3.3.0, Java 11 > Reporter: Akira Ajisaka > Assignee: Akira Ajisaka > Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Problem: > When active NameNode is restarted and loading fsimage, DFSRouters > significantly slow down. > Investigation: > When active NameNode is restarted and loading fsimage, RouterRpcClient > receives SocketException. Since > RouterRpcClient#isUnavailableException(IOException) returns false when the > argument is SocketException, the MembershipNameNodeResolver#cacheNS is not > refreshed. That's why the order of the NameNodes returned by > MemberShipNameNodeResolver#getNamenodesForNameserviceId(String) is unchanged > and the active NameNode is still returned first. Therefore RouterRpcClient > still tries to connect to the NameNode that is loading fsimage. > After loading the fsimage, the NameNode throws StandbyException. The > exception is one of the 'Unavailable Exception' and the cacheNS is refreshed. > Workaround: > Stop NameNode and wait 1 minute before starting NameNode instead of > restarting. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org