[jira] [Commented] (HBASE-19335) Fix waitUntilAllRegionsAssigned

stack (JIRA) Mon, 27 Nov 2017 15:15:17 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267768#comment-16267768
 ]


stack commented on HBASE-19335:
-------------------------------

LGTM. I like use of categorybased timeout. When do these new lists get cleared 
out or is that not important in test context?

> Fix waitUntilAllRegionsAssigned
> -------------------------------
>
>                 Key: HBASE-19335
>                 URL: https://issues.apache.org/jira/browse/HBASE-19335
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Appy
>            Assignee: Appy
>         Attachments: HBASE-19335.master.001.patch, 
> HBASE-19335.master.002.patch
>
>
> Found when debugging flaky test TestRegionObserverInterface#testRecovery.
> In the end, the test does the following:
> - Kills the RS
> - Waits for all regions to be assigned
> - Some validation (unrelated)
> - Cleanup: delete table.
> {noformat}
>       cluster.killRegionServer(rs1.getRegionServer().getServerName());
>       Threads.sleep(1000); // Let the kill soak in.
>       util.waitUntilAllRegionsAssigned(tableName);
>       LOG.info("All regions assigned");
>       verifyMethodResult(SimpleRegionObserver.class,
>         new String[] { "getCtPreReplayWALs", "getCtPostReplayWALs", 
> "getCtPreWALRestore",
>             "getCtPostWALRestore", "getCtPrePut", "getCtPostPut" },
>         tableName, new Integer[] { 1, 1, 2, 2, 0, 0 });
>     } finally {
>       util.deleteTable(tableName);
>       table.close();
>     }
>   }
> {noformat}
> However, looking at test logs, found that we had overlapping Assigns with 
> Unassigns. As a result, regions ended up 'stuck in RIT' and the test timeout.
> Assigns were from the ServerCrashRecovery and Unassigns were from the 
> deleteTable cleanup.
> Which begs the question, why did HBTU.waitUntilAllRegionsAssigned(tableName) 
> not wait until recovery was complete.
> Answer: Looks like that function is only meant for sunny scenarios but not 
> for crashes. It iterates over meta and just [checks for *some value* in the 
> server 
> column|https://github.com/apache/hbase/blob/cdc2bb17ff38dcbd273cf501aea565006e995a06/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java#L3421]
>  which is obviously present and equal to the server that was just killed.
> This bug must be affecting other fault tolerance tests too and fixing it may 
> fix more than just one test, hopefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HBASE-19335) Fix waitUntilAllRegionsAssigned

Reply via email to