[
https://issues.apache.org/jira/browse/HBASE-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267768#comment-16267768
]
stack commented on HBASE-19335:
-------------------------------
LGTM. I like use of categorybased timeout. When do these new lists get cleared
out or is that not important in test context?
> Fix waitUntilAllRegionsAssigned
> -------------------------------
>
> Key: HBASE-19335
> URL: https://issues.apache.org/jira/browse/HBASE-19335
> Project: HBase
> Issue Type: Bug
> Reporter: Appy
> Assignee: Appy
> Attachments: HBASE-19335.master.001.patch,
> HBASE-19335.master.002.patch
>
>
> Found when debugging flaky test TestRegionObserverInterface#testRecovery.
> In the end, the test does the following:
> - Kills the RS
> - Waits for all regions to be assigned
> - Some validation (unrelated)
> - Cleanup: delete table.
> {noformat}
> cluster.killRegionServer(rs1.getRegionServer().getServerName());
> Threads.sleep(1000); // Let the kill soak in.
> util.waitUntilAllRegionsAssigned(tableName);
> LOG.info("All regions assigned");
> verifyMethodResult(SimpleRegionObserver.class,
> new String[] { "getCtPreReplayWALs", "getCtPostReplayWALs",
> "getCtPreWALRestore",
> "getCtPostWALRestore", "getCtPrePut", "getCtPostPut" },
> tableName, new Integer[] { 1, 1, 2, 2, 0, 0 });
> } finally {
> util.deleteTable(tableName);
> table.close();
> }
> }
> {noformat}
> However, looking at test logs, found that we had overlapping Assigns with
> Unassigns. As a result, regions ended up 'stuck in RIT' and the test timeout.
> Assigns were from the ServerCrashRecovery and Unassigns were from the
> deleteTable cleanup.
> Which begs the question, why did HBTU.waitUntilAllRegionsAssigned(tableName)
> not wait until recovery was complete.
> Answer: Looks like that function is only meant for sunny scenarios but not
> for crashes. It iterates over meta and just [checks for *some value* in the
> server
> column|https://github.com/apache/hbase/blob/cdc2bb17ff38dcbd273cf501aea565006e995a06/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java#L3421]
> which is obviously present and equal to the server that was just killed.
> This bug must be affecting other fault tolerance tests too and fixing it may
> fix more than just one test, hopefully.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)