[
https://issues.apache.org/jira/browse/HADOOP-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168799#comment-15168799
]
Junping Du commented on HADOOP-12622:
-------------------------------------
Thanks for review and comments, Jian. Upload v6 patch to incorporate this
comment.
> RetryPolicies (other than FailoverOnNetworkExceptionRetry) should put on
> retry failed reason or the log from RMProxy's retry could be very misleading.
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-12622
> URL: https://issues.apache.org/jira/browse/HADOOP-12622
> Project: Hadoop Common
> Issue Type: Bug
> Components: auto-failover
> Affects Versions: 2.6.0, 2.7.0
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
> Attachments: HADOOP-12622-v2.patch, HADOOP-12622-v3.1.patch,
> HADOOP-12622-v3.patch, HADOOP-12622-v4.patch, HADOOP-12622-v5.patch,
> HADOOP-12622-v6.patch, HADOOP-12622.patch
>
>
> In debugging a NM retry connection to RM (non-HA), the NM log during RM down
> time is very misleading:
> {noformat}
> 2015-12-07 11:37:14,098 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:15,099 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:16,101 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:17,103 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:18,105 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:19,107 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:20,109 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:21,112 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 7 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:22,113 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:23,115 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:54,120 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:55,121 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:56,123 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:57,125 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:58,126 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:37:59,128 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2015-12-07 11:38:00,130 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> {noformat}
> It actually only log client side retry on NetworkConnection failure but not
> include any info on RetryInvocationHandler where the real retry policy works.
> From the code below in RetryInvocationHandler.java, even the retry ends, we
> don't put warn messages to include how much/many time/ counts we spent on
> retry logic that make it harder to debug.
> {code}
> if (failAction != null) {
> if (failAction.reason != null) {
> LOG.warn("Exception while invoking " +
> currentProxy.proxy.getClass()
> + "." + method.getName() + " over " + currentProxy.proxyInfo
> + ". Not retrying because " + failAction.reason, ex);
> }
> throw ex;
> }
> {code}
> We should add failAction.reason as much as we can in multiple retry policies.
> In addition, we should keep consistent in log level for message during the
> retry attempts: now the ipc.client is INFO, but RetryInvocationHandler is
> DEBUG (if not fail_over). We should keep them consistent or it could be very
> confusing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)