[ 
https://issues.apache.org/jira/browse/HBASE-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115543#comment-13115543
 ] 

ramkrishna.s.vasudevan commented on HBASE-4492:
-----------------------------------------------

Hi Ted
I found the problem but i dont have the patch with me now.
Pls correct me if am wrong.
As per my analysis, the testcase always passes under one scenario when the ROOT 
and META are in the same RS.
If we take the testcode 
{code}
    // Bring the RS hosting ROOT down and the RS hosting META down at once
    RegionServerThread rootServer = getServerHostingRoot(cluster);
    RegionServerThread metaServer = getServerHostingMeta(cluster);
    if (rootServer == metaServer) {
      log("ROOT and META on the same server so killing another random server");
      int i=0;
      while (rootServer == metaServer) {
        metaServer = cluster.getRegionServerThreads().get(i);
        i++;
      }
    }
    log("Stopping server hosting ROOT");
    rootServer.getRegionServer().stop("Stopping ROOT server");
    log("Stopping server hosting META #1");
    metaServer.getRegionServer().stop("Stopping META server");
{code}
we try to be cautious if the ROOT and META are in the same RS.  If we find ROOT 
and META in same RS we just assign metaServer to another RS(we dont check if it 
is the one having META) but still call rootServer.stop().
Now this will stop the rootServer internally closing both the root and meta 
region in it.
Hence the line
{code}
cluster.hbaseCluster.waitOnRegionServer(metaServer);
{code}
This also comes out.
Now while the ServerShutdownhandler processes this it can cleanly open a new 
ROOT and META.

If you take the failure cases the problem has been like the ROOT and META are 
assigned to different servers.
There is a time gap inbetween the ROOT rs going down and the META rs going 
down. This where something happens like the ROOT is tried to be assigned by the 
ServerShutdownHandler invoked due to ROOT rs at the same time some one else 
tries to assign the ROOT node(who is this someone am not clear :)) 
Pls do correct me if am wrong.  Will try to provide a patch so that this issue 
doesnt come.



                
> TestRollingRestart fails intermittently
> ---------------------------------------
>
>                 Key: HBASE-4492
>                 URL: https://issues.apache.org/jira/browse/HBASE-4492
>             Project: HBase
>          Issue Type: Test
>            Reporter: Ted Yu
>            Assignee: Jonathan Gray
>         Attachments: 4492.txt
>
>
> I got the following when running test suite on TRUNK:
> {code}
> testBasicRollingRestart(org.apache.hadoop.hbase.master.TestRollingRestart)  
> Time elapsed: 300.28 sec  <<< ERROR!
> java.lang.Exception: test timed out after 300000 milliseconds
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.hbase.master.TestRollingRestart.waitForRSShutdownToStartAndFinish(TestRollingRestart.java:313)
>         at 
> org.apache.hadoop.hbase.master.TestRollingRestart.testBasicRollingRestart(TestRollingRestart.java:210)
> {code}
> I ran TestRollingRestart#testBasicRollingRestart manually afterwards which 
> wiped out test output file for the failed test.
> Similar failure can be found on Jenkins:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/19/testReport/junit/org.apache.hadoop.hbase.master/TestRollingRestart/testBasicRollingRestart/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to