[jira] [Work logged] (HDFS-15419) RBF: Router should retry communicate with NN when cluster is unavailable using configurable time interval

ASF GitHub Bot (Jira) Wed, 16 Sep 2020 13:20:19 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-15419?focusedWorklogId=485359&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-485359
 ]


ASF GitHub Bot logged work on HDFS-15419:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Sep/20 20:19
            Start Date: 16/Sep/20 20:19
    Worklog Time Spent: 10m 
      Work Description: goiri commented on a change in pull request #2082:
URL: https://github.com/apache/hadoop/pull/2082#discussion_r489728490



##########
File path: 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/HdfsClientConfigKeys.java
##########
@@ -250,6 +250,11 @@
   String DFS_LEASE_HARDLIMIT_KEY = "dfs.namenode.lease-hard-limit-sec";
   long DFS_LEASE_HARDLIMIT_DEFAULT = 20 * 60;
 
+  String DFS_ROUTER_RPC_RETRY_INTERVAL_KEY = 
"dfs.router.rpc.retry.interval.seconds";
+  int DFS_ROUTER_RPC_RETRY_INTERVAL_DEFAULT = 10;

Review comment:
       Make it TimeUnit.SECONDS.toXXXX(10)

##########
File path: 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/HdfsClientConfigKeys.java
##########
@@ -250,6 +250,11 @@
   String DFS_LEASE_HARDLIMIT_KEY = "dfs.namenode.lease-hard-limit-sec";
   long DFS_LEASE_HARDLIMIT_DEFAULT = 20 * 60;
 
+  String DFS_ROUTER_RPC_RETRY_INTERVAL_KEY = 
"dfs.router.rpc.retry.interval.seconds";

Review comment:
       Technically is not seconds but time duration.

##########
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcClient.java
##########
@@ -557,6 +567,19 @@ private Object invoke(String nsId, int retryCount, final 
Method method,
           }
 
           // retry
+          try {

Review comment:
       Sure, but we should change the logic to match the old style or at least 
to not have the two styles together.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 485359)
    Time Spent: 20m  (was: 10m)

> RBF: Router should retry communicate with NN when cluster is unavailable 
> using configurable time interval
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15419
>                 URL: https://issues.apache.org/jira/browse/HDFS-15419
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: configuration, hdfs-client, rbf
>            Reporter: bhji123
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When cluster is unavailable, router -> namenode communication will only retry 
> once without any time interval, that is not reasonable.
> For example, in my company, which has several hdfs clusters with more than 
> 1000 nodes, we have encountered this problem. In some cases, the cluster 
> becomes unavailable briefly for about 10 or 30 seconds, at the same time, 
> almost all rpc requests to router failed because router only retry once 
> without time interval.
> It's better for us to enhance the router retry strategy, to retry 
> **communicate with NN using configurable time interval and max retry times.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-15419) RBF: Router should retry communicate with NN when cluster is unavailable using configurable time interval

Reply via email to