[ 
https://issues.apache.org/jira/browse/HBASE-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607706#comment-16607706
 ] 

Mingliang Liu commented on HBASE-21164:
---------------------------------------

Thanks [~stack]. I revised the test a little bit by relaxing the assertion to 
tolerate thread contention. And also make sure the log capturer did capture the 
fail message logs. In v3 patch, the assertion now is basically as following:
{code}
    int count = StringUtils.countMatches(output, failMsg);

    // Following asserts the actual retry number is in range (expectedRetry/2, 
expectedRetry*2).
    // Ideally we can assert the exact retry count. We relax here to tolerate 
contention error.
    int expectedRetry = (int)Math.ceil(Math.log(interval - 100));
    assertTrue(String.format("reportForDuty retries %d times, less than 
expected min %d",
        count, expectedRetry / 2), count > expectedRetry / 2);
    assertTrue(String.format("reportForDuty retries %d times, more than 
expected max %d",
        count, expectedRetry * 2), count < expectedRetry * 2);
{code}

I think it makes sense to do this for following heartbeats. I found 
{{tryRegionServerReport()}} does not return the result, does not log message 
and the testing to be in separate class. I think a new issue is good. Shall we 
put the test util class LogCapture out of the current class so that it can be 
used elsewhere?

> reportForDuty should do (expotential) backoff rather than retry every 3 
> seconds (default).
> ------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21164
>                 URL: https://issues.apache.org/jira/browse/HBASE-21164
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: stack
>            Assignee: Mingliang Liu
>            Priority: Minor
>         Attachments: HBASE-21164.branch-2.1.001.patch, 
> HBASE-21164.branch-2.1.002.patch, HBASE-21164.branch-2.1.003.patch
>
>
> RegionServers do reportForDuty on startup to tell Master they are available. 
> If Master is initializing, and especially on a big cluster when it can take a 
> while particularly if something is amiss, the log every three seconds is 
> annoying and doesn't do anything of use. Do backoff if fails up to a 
> reasonable maximum period. Here is example:
> {code}
> 2018-09-06 14:01:39,312 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to 
> master=vc0207.halxg.cloudera.com,22001,1536266763109 with port=22001, 
> startcode=1536266763109
> 2018-09-06 14:01:39,312 WARN 
> org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; 
> sleeping and then retrying.
> ....
> {code}
> For example, I am looking at a large cluster now that had a backlog of 
> procedure WALs. It is taking a couple of hours recreating the procedure-state 
> because there are millions of procedures outstanding. Meantime, the Master 
> log is just full of the above message -- every three seconds...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to