[ 
https://issues.apache.org/jira/browse/HBASE-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267825#comment-16267825
 ] 

Appy commented on HBASE-19335:
------------------------------

We can't clear the list.
----
Chia-Ping had the same question on RB. Here's a copy for quick ref:
Should we remove the rs from dead list when test try to restart the rs? IIRC, 
the IT has the action which would kill-and-start RS.

Answer:
That's a nice question! Keep them coming.
I considered the case and there is one reason why we can't do that, and one 
more reason why it's not needed.
- We track dead RS to ignore it's presence in meta table. We can't remove a 
dead RS from list because even if the RS hase been restarted, it doesn't mean 
the regions that were on it have been reassigned. Maybe they are still 
undergoing log splitting.

- We don't need to "clean up" the list if a RS has been restarted because 
ServerName (domain + port + startcode) will be different for the new instance 
i.e. a dead ServerName will always remain dead. Another thing could be memory 
used by list, but i suspect it'll ever grow too big to be any issue, and it's 
tests, so it's fine.

Let me mention some of this as comment in the code so others don't trip on same 
thing.

> Fix waitUntilAllRegionsAssigned
> -------------------------------
>
>                 Key: HBASE-19335
>                 URL: https://issues.apache.org/jira/browse/HBASE-19335
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Appy
>            Assignee: Appy
>         Attachments: HBASE-19335.master.001.patch, 
> HBASE-19335.master.002.patch
>
>
> Found when debugging flaky test TestRegionObserverInterface#testRecovery.
> In the end, the test does the following:
> - Kills the RS
> - Waits for all regions to be assigned
> - Some validation (unrelated)
> - Cleanup: delete table.
> {noformat}
>       cluster.killRegionServer(rs1.getRegionServer().getServerName());
>       Threads.sleep(1000); // Let the kill soak in.
>       util.waitUntilAllRegionsAssigned(tableName);
>       LOG.info("All regions assigned");
>       verifyMethodResult(SimpleRegionObserver.class,
>         new String[] { "getCtPreReplayWALs", "getCtPostReplayWALs", 
> "getCtPreWALRestore",
>             "getCtPostWALRestore", "getCtPrePut", "getCtPostPut" },
>         tableName, new Integer[] { 1, 1, 2, 2, 0, 0 });
>     } finally {
>       util.deleteTable(tableName);
>       table.close();
>     }
>   }
> {noformat}
> However, looking at test logs, found that we had overlapping Assigns with 
> Unassigns. As a result, regions ended up 'stuck in RIT' and the test timeout.
> Assigns were from the ServerCrashRecovery and Unassigns were from the 
> deleteTable cleanup.
> Which begs the question, why did HBTU.waitUntilAllRegionsAssigned(tableName) 
> not wait until recovery was complete.
> Answer: Looks like that function is only meant for sunny scenarios but not 
> for crashes. It iterates over meta and just [checks for *some value* in the 
> server 
> column|https://github.com/apache/hbase/blob/cdc2bb17ff38dcbd273cf501aea565006e995a06/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java#L3421]
>  which is obviously present and equal to the server that was just killed.
> This bug must be affecting other fault tolerance tests too and fixing it may 
> fix more than just one test, hopefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to