[ https://issues.apache.org/jira/browse/HDFS-13119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356614#comment-16356614 ]
Yiqun Lin edited comment on HDFS-13119 at 2/8/18 8:05 AM: ---------------------------------------------------------- Just looked into this, {quote}When a federated cluster has one of the subcluster down, operations that run in every subcluster (RouterRpcClient#invokeAll()) may take all the RPC connections. {quote} Looked into the related code, I didn't see the logic for triggering RPC requests for every subclustet once one subcluster was down. I just looked the method {{RouterRpcClient#invoke}} invoked in {{RouterRpcClient#invokeMethod}}. Correct me If I am wrong. {quote} Better control of the number of RPC clients {quote} Not so clear for this, do you mean we may have a maximum RPC queue size in Router RPC server side? I have a proposal for "No need to try so many times if we "know" the subcluster is down": When the failed happened, then query from {{ActiveNamenodeResolver}} if the cluster is down, if yes, don't do retry. In addition, current default retry times (10 times) can be decreased a lot. was (Author: linyiqun): Just looked into this, {quote}When a federated cluster has one of the subcluster down, operations that run in every subcluster (RouterRpcClient#invokeAll()) may take all the RPC connections. {quote} Looked into the related code, I didn't see the logic for triggering RPC requests for every subclustet once one subcluster was down. I just looked the method {{RouterRpcClient#invoke}} invoked in {{RouterRpcClient#invokeMethod}}. Correct me If I am wrong. Not so clear for this, would you describe more? {quote} Better control of the number of RPC clients {quote} I have a proposal for "No need to try so many times if we "know" the subcluster is down": When the failed happened, then query from {{ActiveNamenodeResolver}} if the cluster is down, if yes, don't do retry. In addition, current default retry times (10 times) can be decreased a lot. > RBF: Manage unavailable clusters > -------------------------------- > > Key: HDFS-13119 > URL: https://issues.apache.org/jira/browse/HDFS-13119 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Íñigo Goiri > Assignee: Yiqun Lin > Priority: Major > > When a federated cluster has one of the subcluster down, operations that run > in every subcluster ({{RouterRpcClient#invokeAll()}}) may take all the RPC > connections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org