[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508048#comment-14508048 ] Arpit Agarwal commented on HADOOP-10597: +1 for the updated patch. Reviewed the test case and it looks like Steve's comment was addressed. Thanks Ming. I'll commit this by tomorrow. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495770#comment-14495770 ] Hadoop QA commented on HADOOP-10597: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725491/HADOOP-10597-6.patch against trunk revision fddd552. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6104//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6104//console This message is automatically generated. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494917#comment-14494917 ] Steve Loughran commented on HADOOP-10597: - If arpit is happy, I'm happy the one thing I would change is in the test where {{assertTrue(RetriableException not received, succeeded);}} is in the finally clause, followed by the cleanup actions. that assert should appear outside the finally block, which should be kept for the cleanup on all exit points, normal and exceptional. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493554#comment-14493554 ] Arpit Agarwal commented on HADOOP-10597: The code change looks good to me. I am reviewing at the test case. I won't commit it until at least next week in case Steve or [~chrilisf] have comments. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492555#comment-14492555 ] Hadoop QA commented on HADOOP-10597: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724949/HADOOP-10597-5.patch against trunk revision 174d8b3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6099//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6099//console This message is automatically generated. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358142#comment-14358142 ] Ming Ma commented on HADOOP-10597: -- bq. Did you use the same trigger on the server in production (RPC queue being full)? Yes, that is what we use. bq. What kind of back-off policy are you using on the client? This patch doesn't add any new policy on the client side. It tries to use the policy passed from the server if it is specified. But given we don't plan to support server side policy, the new patch doesn't need to change anything on the client side. The client side will receive RetriableException and retry accordingly. Regarding the client side retry policy we use, we don't config anything specifically. We use the default. Not config for {{DFS_CLIENT_RETRY_POLICY_SPEC_KEY}}. Thus we end up with {{FailoverOnNetworkExceptionRetry}} which uses exponential backoff. The actual parameters used in the backoff are be based on {{DFS_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY}}, {{DFS_CLIENT_FAILOVER_SLEEPTIME_BASE_KEY}} and {{DFS_CLIENT_FAILOVER_SLEEPTIME_MAX_KEY}}. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349548#comment-14349548 ] Arpit Agarwal commented on HADOOP-10597: Hi [~mingma], thanks for proposing this work and posting the patch. I still prefer that we don't support multiple retry policies. Exponential back-off is known to be the only mechanism that converges to fairness in a loaded system. If Steve does not object then the v1 implementation can avoid sending retry policies. It should significantly simplify the patch and sidestep the issue of how to communicate the policy. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349649#comment-14349649 ] Ming Ma commented on HADOOP-10597: -- Thanks, [~arpitagarwal]. We have been using no server retry policy option in production clusters for some time and things are fine. So I agree with you; at least we don't have to add it in the initial version. If later we have scenarios that require the server to supply retry policies, we can add it at that point. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349693#comment-14349693 ] Arpit Agarwal commented on HADOOP-10597: bq. We have been using no server retry policy option in production clusters for some time and things are fine. That's a good datapoint, thanks. Couple of questions: # What kind of back-off policy are you using on the client? Do you have a separate Jira for the client side work? I just realized today that we already have configurable retry policies via {{DFS_CLIENT_RETRY_POLICY_SPEC_KEY}}. We also have a {{ExponentialBackoffRetry}} but it doesn't look widely used. Did you use either of these? # Did you use the same trigger on the server in production (RPC queue being full)? Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269374#comment-14269374 ] Steve Loughran commented on HADOOP-10597: - I think this is a topic for the RPC people, not me. I just would prefer to see a new field, not stuff hidden into the text payload. That way lies WS-* Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267318#comment-14267318 ] Hadoop QA commented on HADOOP-10597: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690499/HADOOP-10597-4.patch against trunk revision 788ee35. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1215 javac compiler warnings (more than the trunk's current 1214 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5373//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/5373//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5373//console This message is automatically generated. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Steve Loughran Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597-4.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255233#comment-14255233 ] Steve Loughran commented on HADOOP-10597: - Looks OK to me... Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255234#comment-14255234 ] Steve Loughran commented on HADOOP-10597: - -sorry, browser submitted too early. Looks OK to me, though its gone deep enough into the RPC stack I'm out of my depth. Minor recommendations * tag things as audience=private as well as unstable * {{LinearClientBackoffPolicy}} p 46-49, can we pull these inline strings out as public constants? It keeps errors down in tests other code setting things. * {{NullClientBackoffPolicy}} should just extend {{Configured}} to remove boiler plate set/get conf logic * TestRPC.testClientBackOff(). Recommend saving any caught IOE and, if !succeeded, rethrowing it. It'll help debugging failing tests. One thing I will highlight is I.m not that enamoured of how the retriable exception protobuf data is being marshalled into the string value of the exception. Why choose this approach? Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Steve Loughran Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253920#comment-14253920 ] Hadoop QA commented on HADOOP-10597: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688388/HADOOP-10597-3.patch against trunk revision 6635ccd. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1217 javac compiler warnings (more than the trunk's current 1216 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 4 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//console This message is automatically generated. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218307#comment-14218307 ] Chris Li commented on HADOOP-10597: --- Ah okay, thanks for clarifying, makes sense now Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208955#comment-14208955 ] Ming Ma commented on HADOOP-10597: -- Thanks, Chris. The backoff retry policy is defined by interface {{ClientBackoffPolicy}}. There are two implementations of the interface, {{NullClientBackoffPolicy}} and {{LinearClientBackoffPolicy}}. The experiment results are based on {{NullClientBackoffPolicy}} which doesn't specify any retry policy. Thus RPC server will return empty {{RetriableException}} and let client decides the retry policy. We can start with this policy when we enable this feature in production. That will provide us useful info and help us to improve the feature and make necessary modification to {{ClientBackoffPolicy}} and its implementations in next iterations. {{LinearClientBackoffPolicy}} specifies retry policy based on numbers of succeeded and denied requests. The policy will then be returned to the client and the client is expected to honor that. {{recentBackOffCount}} will decrease with each successful queued request. So in your case, if a client is denied first and then terminates before it retries, as long as enough requests from other clients are queued successfully, {{recentBackOffCount}} will become zero. There shouldn't be a case where the element be queued correctly but the client gets a retry. The warn message is there to catch bad implementation of {{ClientBackoffPolicy}}. We can remove that as it doesn't seem to be necessary. Yes, it is better to rename oldValue to something else. I will provide an updated patch after rebase to address your comments. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201000#comment-14201000 ] Chris Li commented on HADOOP-10597: --- Cool. I think this is a good feature to have. One small question about the code: +LOG.warn(Element + e + was queued properly. + +But client is asked to retry.); From my brief study of the code, it seems like isCallQueued is passed pretty deep in order to maintain some sort of reference count on how many pending requests each handler has waiting client-side to retry. Does this count always balance to zero? What if a client makes a request, is denied, and then terminates before it can make a request that successfully queues? Also, what conditions will the element be queued correctly but the client gets a retry? Also kind of a small thing but instead of recentBackOffCount.set(oldValue) it would be more clear to create a new variable newValue and recentBackOffCount.set(newValue) instead of mutating oldValue, or perhaps just rename the oldValue variable to something which doesn't imply immutability Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171966#comment-14171966 ] Ming Ma commented on HADOOP-10597: -- Thanks, Chris. 1. The first two experiments belong to user A is doing some bad things, measure user B's read latency.. The rest of the experiments are done under single user to measure the performance implication under different loads. 2. We can use client backoff without FCQ. But it is less interesting, given it could penalize good clients. That is because in the current implementation, the criteria RPC server uses to decide if it needs to ask client to back off is whether the RPC call queue is full. We can improve this criteria later if this isn't enough. 3. The experiment results are based on client driven retry interval policy. It means the server only asks the client to back off; RPC client will decide retry policy. In NN HA setup, that will be FailoverOnNetworkExceptionRetry which does exponential back off. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169689#comment-14169689 ] Chris Li commented on HADOOP-10597: --- Hi [~mingma], thanks for adding some numbers. If I understand correctly from the graph, the latency spike is a result of maxing out the call queue's capacity, which FairCallQueue will not solve since FCQ has no choice but to enqueue a call somewhere. Just to double check, were all these calls made under the same user? I'd guess that RPC client backoff would work just as well when FairCallQueue is disabled too, since it solves the different problem of alleviating a full queue. I do agree with Steve that we'll want some fuzz on the retry method, since linear could cause load to be periodic over time Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144401#comment-14144401 ] Hadoop QA commented on HADOOP-10597: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670619/MoreRPCClientBackoffEvaluation.pdf against trunk revision a9a55db. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4789//console This message is automatically generated. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597.patch, MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081779#comment-14081779 ] Arpit Agarwal commented on HADOOP-10597: Hi [~mingma], just looked at the attached doc. If I understand correctly, the server tells the client which backoff policy to use. The backoff policy can just be a client side configuration since the server's suggestion is just advisory, unless the server has a way to penalize clients that fail to follow the suggestion. Also you have probably seen the RPC Congestion Control work under HADOOP-9460. Is there any overlap? Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597.patch, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081950#comment-14081950 ] Ming Ma commented on HADOOP-10597: -- Thanks, Jing and Arpit. 1. In the current implementation, RPC server only throws RetriableException back to client when RPC queue is full, or more specifically RPC queue is full for the RPC user with HADOOP-9460. So before RPC queue is full, there should be no difference. It might be interesting to verify large number of connections scenario. The blocking approach could hold up lots of TCP connections and thus other users' request can't connect. 2. The value of server defined backoff policy. So far I don't have any use case that requires server to specify backoff policy. So it is possible all we need is to have RPC server throws RetriableException without backoff policy. I put it there for extensibility and based on Steve's suggestion. This might still be useful later. What if the client doesn't honor the policy? In a controlled environment, we can assume a single client will use hadoop RPC client which enforce the policy; if we have many clients, then the backoff policy component in RPC server such as LinearClientBackoffPolicy can keep state and can adjust the backoff policy parameters. 3. How it is related to HADOOP-9640. HADOOP-9640 is quite useful. client backoff can be complementary to that. FairQueue currently is blocking; if a given RPC request's enqueue to FairQueue is blocked due to FairQueue policy, it will hold up TCP connection and the reader threads. If we use FairQueue together with client backoff, requests from some heavy load application won't hold up TCP connection and the reader threads; thus allow other applications' request to be processed more quickly. Some evaluation to compare HADOOP-9640 with HADOOP-9640 + client backoff might be useful. I will follow up with Chris Li on that. Is there any other scenarios? For example, we can have RPC rejects requests based on user id, method name or machine ip for some operational situations. Granted, these can also be handled at the higher layer. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597.patch, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077154#comment-14077154 ] Jing Zhao commented on HADOOP-10597: This can be a very useful feature. For the evaluation, maybe It will be helpful to see how the change will affect the request latency and NN throughput when the workload of NN increases from low to high. For example, one of my questions is that when the NN workload is just relatively high (there are some but not many requests are blocking on the queue), will the average latency and throughput be affected if NN directly lets client back off instead of blocking and timeout? Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597.patch, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14052236#comment-14052236 ] Hadoop QA commented on HADOOP-10597: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654055/HADOOP-10597-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1261 javac compiler warnings (more than the trunk's current 1260 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common: org.apache.hadoop.ipc.TestRPC {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//console This message is automatically generated. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Assignee: Ming Ma Attachments: HADOOP-10597-2.patch, HADOOP-10597.patch, RPCClientBackoffDesignAndEvaluation.pdf Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HADOOP-10597) Evaluate if we can have RPC client back off when server is under heavy load
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998689#comment-13998689 ] Steve Loughran commented on HADOOP-10597: - this could be useful for clients of other services too, where the back-off message could trigger redirect. maybe the response could include some hints of # where else to go # backoff parameter hints: sleep time, growth, jitter. This gives the NN more control of the clients, lets you spread the jitter, and grow the backoff time as load increases -so reducing socket connection load. Evaluate if we can have RPC client back off when server is under heavy load --- Key: HADOOP-10597 URL: https://issues.apache.org/jira/browse/HADOOP-10597 Project: Hadoop Common Issue Type: Sub-task Reporter: Ming Ma Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.2#6252)