Nick Dimiduk created HBASE-12403:
------------------------------------

             Summary: IntegrationTestMTTR flaky due to aggressive RS restart 
timeout
                 Key: HBASE-12403
                 URL: https://issues.apache.org/jira/browse/HBASE-12403
             Project: HBase
          Issue Type: Test
          Components: integration tests
            Reporter: Nick Dimiduk
            Priority: Minor


TL;DR: the CM RestartRS action timeout is only 60 seconds. Considering the RS 
must connect to the Master before it can be online, this is not long enough 
time in an environment where the Master can also be killed.

Failure from the console says the test failed because a 
RestartRsHoldingMetaAction timed out.

{noformat}
Caused by: java.io.IOException: did timeout waiting for region server to 
start:ip-172-31-42-248.ec2.internal
at 
org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:153)
at org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:93)
at 
org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.restartRs(RestartActionBaseAction.java:52)
at 
org.apache.hadoop.hbase.chaos.actions.RestartRsHoldingMetaAction.perform(RestartRsHoldingMetaAction.java:38)
at 
org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:559)
at 
org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:550)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

This is only reported at the end of the test run. There's no indication as to 
when during the test run this failure happened. The timeout on the start RS 
operation is 60 seconds.

Hacking out the start/stop messages from the logs during the time window when 
this test ran, it appears that at one point the RS took 2min 12s between when 
it was launched and when it reported for duty

{noformat}
Fri Oct 31 14:53:17 UTC 2014 Starting regionserver on ip-172-31-42-248
2014-10-31 14:55:29,049 INFO  [regionserver60020] regionserver.HRegionServer: 
Serving as ip-172-31-42-248.ec2.internal,60020,1414767238992, RpcServer on 
ip-172-31-42-248.ec2.internal/172.31.42.248:60020, sessionid=0x249661c2b7b0118
{noformat}

The RS came up without incident. It spent 1min 4s of that time waiting on the 
master to start, attempted to report for duty from 14:54:28 to 14:55:24.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to