Íñigo Goiri commented on HDFS-13119:

The unit tests in  [^HDFS-13119.002.patch] look good.
Just to summarize, the current solution has two parts:
# Limit the number of threads for parallel operations that use 
{{RouterRpcClient #invokeConcurrent()}}; with this we should prevent having a 
huge thread pool.
# When we fail to invoke a method in a subcluster, we check if it is 
unavailable according to our membership and if it is, we try once more; 
otherwise fail.

For #1, we have a separate pool for the connections. [~chris.douglas] helped 
review that part. Any thoughts on this solution?
For #2, the code is kind of complicated to allow just one retry, it might be OK 
to just check if the cluster is unavailable and throw the exception if so 
without one retry. Otherwise, we could just do:
if (isClusterUnAvailable(nsId) && retryCount > 0) {
  throw new IOException("No namenode available under nameservice " + nsId, ioe);
Then, the default logic takes care of the first retry.

> RBF: Manage unavailable clusters
> --------------------------------
>                 Key: HDFS-13119
>                 URL: https://issues.apache.org/jira/browse/HDFS-13119
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Íñigo Goiri
>            Assignee: Yiqun Lin
>            Priority: Major
>         Attachments: HDFS-13119.001.patch, HDFS-13119.002.patch
> When a federated cluster has one of the subcluster down, operations that run 
> in every subcluster ({{RouterRpcClient#invokeAll()}}) may take all the RPC 
> connections.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to