Appy created HBASE-19335:
----------------------------
Summary: Fix waitUntilAllRegionsAssigned
Key: HBASE-19335
URL: https://issues.apache.org/jira/browse/HBASE-19335
Project: HBase
Issue Type: Bug
Reporter: Appy
Assignee: Appy
Found when debugging flaky test TestRegionObserverInterface#testRecovery.
In the end, the test does the following:
- Kills the RS
- Waits for all regions to be assigned
- Some validation (unrelated)
- Cleanup: delete table.
{noformat}
cluster.killRegionServer(rs1.getRegionServer().getServerName());
Threads.sleep(1000); // Let the kill soak in.
util.waitUntilAllRegionsAssigned(tableName);
LOG.info("All regions assigned");
verifyMethodResult(SimpleRegionObserver.class,
new String[] { "getCtPreReplayWALs", "getCtPostReplayWALs",
"getCtPreWALRestore",
"getCtPostWALRestore", "getCtPrePut", "getCtPostPut" },
tableName, new Integer[] { 1, 1, 2, 2, 0, 0 });
} finally {
util.deleteTable(tableName);
table.close();
}
}
{noformat}
However, looking at test logs, found that we had overlapping Assigns with
Unassigns. As a result, regions ended up 'stuck in RIT' and the test timeout.
Assigns were from the ServerCrashRecovery and Unassigns were from the
deleteTable cleanup.
Which begs the question, why did HBTU.waitUntilAllRegionsAssigned(tableName)
not wait until recovery was complete.
Answer: Looks like that function is only meant for sunny scenarios but not for
crashes. It iterates over meta and just [checks for *some value* in the server
column|https://github.com/apache/hbase/blob/cdc2bb17ff38dcbd273cf501aea565006e995a06/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java#L3421]
which is obviously present and equal to the server that was just killed.
This bug must be affecting other fault tolerance tests too and fixing it may
fix more than just one test, hopefully.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)