[ 
https://issues.apache.org/jira/browse/HADOOP-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089704#comment-15089704
 ] 

Steve Loughran commented on HADOOP-12622:
-----------------------------------------

Looks OK. If there's one thing I don't like, it's that you have strings in the 
policies that you have copied and pasted into the tests for validation. I'd 
prefer you used constant strings in the production code and used the same 
strings in the tests. Why? Stops the tests failing if you ever change the 
message.

> RetryPolicies (other than FailoverOnNetworkExceptionRetry) should put on 
> retry failed reason or the log from RMProxy's retry could be very misleading.
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-12622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12622
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: auto-failover
>    Affects Versions: 2.6.0, 2.7.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: HADOOP-12622-v2.patch, HADOOP-12622-v3.1.patch, 
> HADOOP-12622-v3.patch, HADOOP-12622.patch
>
>
> In debugging a NM retry connection to RM (non-HA), the NM log during RM down 
> time is very misleading:
> {noformat}
> 2015-12-07 11:37:14,098 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:15,099 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:16,101 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:17,103 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:18,105 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:19,107 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:20,109 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:21,112 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 7 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:22,113 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:23,115 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:54,120 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:55,121 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:56,123 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:57,125 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:58,126 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:59,128 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:38:00,130 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> {noformat}
> It actually only log client side retry on NetworkConnection failure but not 
> include any info on RetryInvocationHandler where the real retry policy works. 
> From the code below in RetryInvocationHandler.java, even the retry ends, we 
> don't put warn messages to include how much/many time/ counts we spent on 
> retry logic that make it harder to debug.
> {code}
>         if (failAction != null) {
>           if (failAction.reason != null) {
>             LOG.warn("Exception while invoking " + 
> currentProxy.proxy.getClass()
>                 + "." + method.getName() + " over " + currentProxy.proxyInfo
>                 + ". Not retrying because " + failAction.reason, ex);
>           }
>           throw ex;
>         }
> {code}
> We should add failAction.reason as much as we can in multiple retry policies. 
> In addition, we should keep consistent in log level for message during the 
> retry attempts: now the ipc.client is INFO, but RetryInvocationHandler is 
> DEBUG (if not fail_over). We should keep them consistent or it could be very 
> confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to