[
https://issues.apache.org/jira/browse/HBASE-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193450#comment-14193450
]
Hudson commented on HBASE-12403:
--------------------------------
FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #615 (See
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/615/])
HBASE-12403 IntegrationTestMTTR flaky due to aggressive RS restart timeout
(ndimiduk: rev 414bed7197097db4e2ce638f46d9996fdfb305b1)
* hbase-it/src/test/java/org/apache/hadoop/hbase/mttr/IntegrationTestMTTR.java
* hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
> IntegrationTestMTTR flaky due to aggressive RS restart timeout
> --------------------------------------------------------------
>
> Key: HBASE-12403
> URL: https://issues.apache.org/jira/browse/HBASE-12403
> Project: HBase
> Issue Type: Test
> Components: integration tests
> Reporter: Nick Dimiduk
> Assignee: Nick Dimiduk
> Priority: Minor
> Fix For: 2.0.0, 0.98.8, 0.99.2
>
> Attachments: HBASE-12403.00.patch
>
>
> TL;DR: the CM RestartRS action timeout is only 60 seconds. Considering the RS
> must connect to the Master before it can be online, this is not long enough
> time in an environment where the Master can also be killed.
> Failure from the console says the test failed because a
> RestartRsHoldingMetaAction timed out.
> {noformat}
> Caused by: java.io.IOException: did timeout waiting for region server to
> start:ip-172-31-42-248.ec2.internal
> at
> org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:153)
> at org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:93)
> at
> org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.restartRs(RestartActionBaseAction.java:52)
> at
> org.apache.hadoop.hbase.chaos.actions.RestartRsHoldingMetaAction.perform(RestartRsHoldingMetaAction.java:38)
> at
> org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:559)
> at
> org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$ActionCallable.call(IntegrationTestMTTR.java:550)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This is only reported at the end of the test run. There's no indication as to
> when during the test run this failure happened. The timeout on the start RS
> operation is 60 seconds.
> Hacking out the start/stop messages from the logs during the time window when
> this test ran, it appears that at one point the RS took 2min 12s between when
> it was launched and when it reported for duty
> {noformat}
> Fri Oct 31 14:53:17 UTC 2014 Starting regionserver on ip-172-31-42-248
> 2014-10-31 14:55:29,049 INFO [regionserver60020] regionserver.HRegionServer:
> Serving as ip-172-31-42-248.ec2.internal,60020,1414767238992, RpcServer on
> ip-172-31-42-248.ec2.internal/172.31.42.248:60020, sessionid=0x249661c2b7b0118
> {noformat}
> The RS came up without incident. It spent 1min 4s of that time waiting on the
> master to start, attempted to report for duty from 14:54:28 to 14:55:24.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)