[
https://issues.apache.org/jira/browse/HBASE-18261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Umesh Agashe updated HBASE-18261:
---------------------------------
Comment: was deleted
(was: Hi [~stack], [~yangzhe1991]:
FWICS here is the root cause:
The UT tests ServerCrashProcedure when RS carrying meta region crashes. It also
simulates master crash after executing each step in the procedure.
Initially all RS are at the same version i.e. 3.0.0-SNAPSHOT.
HMaster.getRegionServerVersion() returns version 0.0.0 for dead RS (carrying
meta). This makes AssignmentManager.getExcludedServersForSystemTable() return
non-empty list and the logic in
AssignmentManager.checkIfShouldMoveSystemRegionAsync() is triggered which in
turn submits MoveRegionProcedure to move meta region from RS with version 0.0.0
to one of other RS with latest version.
As commented before this causes race condition between scan and
MoveRegionProcedure.
AssignmentManager.getExcludedServersForSystemTable() uses
master.getServerManager().getOnlineServersList() to get list of online servers
only. But on further scrutiny of code and logs I found that server can be
online and dead at the same time!
IMO,
* Currently meta is re/assigned from ServerCrashProcedure, during master
initialization from MasterMetaBootstrap and followed by in
checkIfShouldMoveSystemRegionAsync().
* that means meta re/assignment may be attempted at max 3 times in certain
conditions.
* I am working on HBASE-18261 to have meta recovery/ assignment logic at one
place.
* I think we can pull these changes for assigning meta to RS with highest
version number there.
* This will result in, RS with highest version number will be considered for
meta region assignment when:
# When meta region carrying RS crashes
# During master startup
Along with above changes, obviously we need to fix
ServerManager.isServerOnline() and ServerManager.isServerDead() returning true
at the same time. This could be result of test code simulating crash but the
class itself should not allow this case (IMHO).
I have a following fix ready (and tested) which will fix the test but I don't
consider it a long term fix.
{code}
diff --git
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
index 046612a..1a2d53b 100644
---
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
+++
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
@@ -1760,6 +1760,7 @@ public class AssignmentManager implements ServerListener {
public List<ServerName> getExcludedServersForSystemTable() {
List<Pair<ServerName, String>> serverList =
master.getServerManager().getOnlineServersList()
.stream()
+ .filter((s)->!master.getServerManager().isServerDead(s))
.map((s)->new Pair<>(s, master.getRegionServerVersion(s)))
.collect(Collectors.toList());
if (serverList.isEmpty()) {
{code}
[~stack], as you have suggested, we can disable the test for now. When we agree
on fix, we can enable it. Let me know your thoughts. Thanks!)
> [AMv2] Create new RecoverMetaProcedure and use it from ServerCrashProcedure
> and HMaster.finishActiveMasterInitialization()
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-18261
> URL: https://issues.apache.org/jira/browse/HBASE-18261
> Project: HBase
> Issue Type: Improvement
> Components: amv2
> Affects Versions: 2.0.0-alpha-1
> Reporter: Umesh Agashe
> Assignee: Umesh Agashe
> Fix For: 2.0.0-alpha-2
>
> Attachments: HBASE-18261.master.001.patch
>
>
> When unit test
> hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta()
> is enabled and run several times, it fails intermittently. Cause is meta
> recovery is done at two different places:
> * ServerCrashProcedure.processMeta()
> * HMaster.finishActiveMasterInitialization()
> and its not coordinated.
> When HMaster.finishActiveMasterInitialization() gets to submit splitMetaLog()
> first and while its running call from ServerCrashProcedure.processMeta()
> fails causing step to be retried again in a loop.
> When ServerCrashProcedure.processMeta() submits splitMetaLog after
> splitMetaLog from HMaster.finishActiveMasterInitialization() is finished,
> success is returned without doing any work.
> But if ServerCrashProcedure.processMeta() submits splitMetaLog request and
> while its going HMaster.finishActiveMasterInitialization() submits it test
> fails with exception.
> [~stack] and I discussed the possible solution:
> Create RecoverMetaProcedure and call it where required. Procedure framework
> provides mutual exclusion and requires idempotence, which should fix the
> problem.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)