[ 
https://issues.apache.org/jira/browse/HADOOP-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HADOOP-12622:
--------------------------------
    Status: Patch Available  (was: Open)

> RetryPolicies (other than FailoverOnNetworkExceptionRetry) should put on 
> retry failed reason or the log from RMProxy's retry could be very misleading.
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-12622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12622
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: auto-failover
>    Affects Versions: 2.7.0, 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: HADOOP-12622.patch
>
>
> In debugging a NM retry connection to RM (non-HA), the NM log during RM down 
> time is very misleading:
> {noformat}
> 2015-12-07 11:37:14,098 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:15,099 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:16,101 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:17,103 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:18,105 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:19,107 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:20,109 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:21,112 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 7 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:22,113 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:23,115 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:54,120 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:55,121 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:56,123 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:57,125 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:58,126 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:37:59,128 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-12-07 11:38:00,130 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> {noformat}
> It actually only log client side retry on NetworkConnection failure but not 
> include any info on RetryInvocationHandler where the real retry policy works. 
> From the code below in RetryInvocationHandler.java, even the retry ends, we 
> don't put warn messages to include how much/many time/ counts we spent on 
> retry logic that make it harder to debug.
> {code}
>         if (failAction != null) {
>           if (failAction.reason != null) {
>             LOG.warn("Exception while invoking " + 
> currentProxy.proxy.getClass()
>                 + "." + method.getName() + " over " + currentProxy.proxyInfo
>                 + ". Not retrying because " + failAction.reason, ex);
>           }
>           throw ex;
>         }
> {code}
> We should add failAction.reason as much as we can in multiple retry policies. 
> In addition, we should keep consistent in log level for message during the 
> retry attempts: now the ipc.client is INFO, but RetryInvocationHandler is 
> DEBUG (if not fail_over). We should keep them consistent or it could be very 
> confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to