[
https://issues.apache.org/jira/browse/HBASE-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115543#comment-13115543
]
ramkrishna.s.vasudevan commented on HBASE-4492:
-----------------------------------------------
Hi Ted
I found the problem but i dont have the patch with me now.
Pls correct me if am wrong.
As per my analysis, the testcase always passes under one scenario when the ROOT
and META are in the same RS.
If we take the testcode
{code}
// Bring the RS hosting ROOT down and the RS hosting META down at once
RegionServerThread rootServer = getServerHostingRoot(cluster);
RegionServerThread metaServer = getServerHostingMeta(cluster);
if (rootServer == metaServer) {
log("ROOT and META on the same server so killing another random server");
int i=0;
while (rootServer == metaServer) {
metaServer = cluster.getRegionServerThreads().get(i);
i++;
}
}
log("Stopping server hosting ROOT");
rootServer.getRegionServer().stop("Stopping ROOT server");
log("Stopping server hosting META #1");
metaServer.getRegionServer().stop("Stopping META server");
{code}
we try to be cautious if the ROOT and META are in the same RS. If we find ROOT
and META in same RS we just assign metaServer to another RS(we dont check if it
is the one having META) but still call rootServer.stop().
Now this will stop the rootServer internally closing both the root and meta
region in it.
Hence the line
{code}
cluster.hbaseCluster.waitOnRegionServer(metaServer);
{code}
This also comes out.
Now while the ServerShutdownhandler processes this it can cleanly open a new
ROOT and META.
If you take the failure cases the problem has been like the ROOT and META are
assigned to different servers.
There is a time gap inbetween the ROOT rs going down and the META rs going
down. This where something happens like the ROOT is tried to be assigned by the
ServerShutdownHandler invoked due to ROOT rs at the same time some one else
tries to assign the ROOT node(who is this someone am not clear :))
Pls do correct me if am wrong. Will try to provide a patch so that this issue
doesnt come.
> TestRollingRestart fails intermittently
> ---------------------------------------
>
> Key: HBASE-4492
> URL: https://issues.apache.org/jira/browse/HBASE-4492
> Project: HBase
> Issue Type: Test
> Reporter: Ted Yu
> Assignee: Jonathan Gray
> Attachments: 4492.txt
>
>
> I got the following when running test suite on TRUNK:
> {code}
> testBasicRollingRestart(org.apache.hadoop.hbase.master.TestRollingRestart)
> Time elapsed: 300.28 sec <<< ERROR!
> java.lang.Exception: test timed out after 300000 milliseconds
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.hbase.master.TestRollingRestart.waitForRSShutdownToStartAndFinish(TestRollingRestart.java:313)
> at
> org.apache.hadoop.hbase.master.TestRollingRestart.testBasicRollingRestart(TestRollingRestart.java:210)
> {code}
> I ran TestRollingRestart#testBasicRollingRestart manually afterwards which
> wiped out test output file for the failed test.
> Similar failure can be found on Jenkins:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/19/testReport/junit/org.apache.hadoop.hbase.master/TestRollingRestart/testBasicRollingRestart/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira