[
https://issues.apache.org/jira/browse/HBASE-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713132#comment-16713132
]
Andrew Purtell edited comment on HBASE-21565 at 12/7/18 5:53 PM:
-----------------------------------------------------------------
[~tianjingyun] The goal of HBASE-21266 was to fix a different problem, where
numProcessing could get out of sync with the recorded set of processing
servers, and also to fix that problem while not causing any unit tests to fail.
It wasn't a change that considered all aspects of dead server processing
including special cases in master initialization. This is a long way of saying
I don't think there is a conflict, the dead server list is serving multiple
overloaded functions. If it is not quite right we need your proposed changes
too. To your point, I would agree with this:
{quote}Or maybe we should add another barrier for this?
{quote}
I don't think it is strictly necessary but loading up DeadServers with multiple
semantics makes it hard to maintain and fix.
Also, I work mostly with branch-1 so glad to see Duo is already here, or maybe
stack, someone more familiar with AMv2 should have a look. Thanks.
was (Author: apurtell):
[~tianjingyun] The goal of HBASE-21266 was to fix a different problem, where
numProcessing could get out of sync with the recorded set of processing
servers, and also to fix that problem while not causing any unit tests to fail.
It wasn't a change that considered all aspects of dead server processing
including special cases in master initialization. This is a long way of saying
I don't think there is a conflict, the dead server list is serving multiple
overloaded functions. To your point, I would agree with this:
{quote}Or maybe we should add another barrier for this?
{quote}
I don't think it is strictly necessary but loading up DeadServers with multiple
semantics makes it hard to maintain and fix.
Also, I work mostly with branch-1 so glad to see Duo is already here, or maybe
stack, someone more familiar with AMv2 should have a look. Thanks.
> Delete dead server from dead server list too early leads to concurrent Server
> Crash Procedures(SCP) for a same server
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-21565
> URL: https://issues.apache.org/jira/browse/HBASE-21565
> Project: HBase
> Issue Type: Bug
> Reporter: Jingyun Tian
> Assignee: Jingyun Tian
> Priority: Critical
> Attachments: HBASE-21565.master.001.patch
>
>
> There are 2 kinds of SCP for a same server will be scheduled during cluster
> restart, one is ZK session timeout, the other one is new server report in
> will cause the stale one do fail over. The only barrier for these 2 kinds of
> SCP is check if the server is in the dead server list.
> {code}
> if (this.deadservers.isDeadServer(serverName)) {
> LOG.warn("Expiration called on {} but crash processing already in
> progress", serverName);
> return false;
> }
> {code}
> But the problem is when master finish initialization, it will delete all
> stale servers from dead server list. Thus when the SCP for ZK session timeout
> come in, the barrier is already removed.
> Here is the logs that how this problem occur.
> {code}
> 2018-12-07,11:42:37,589 INFO
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9,
> state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> 2018-12-07,11:42:58,007 INFO
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444,
> state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> {code}
> Now we can see two SCP are scheduled for the same server.
> But the first procedure is finished after the second SCP starts.
> {code}
> 2018-12-07,11:43:08,038 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9,
> state=SUCCESS, hasLock=false; ServerCrashProcedure
> server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
> in 30.5340sec
> {code}
> Thus it will leads the problem that regions will be assigned twice.
> {code}
> 2018-12-07,12:16:33,039 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN,
> location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover,
> region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on
> server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise
> {code}
> And here we can see the server is removed from dead server list before the
> second SCP starts.
> {code}
> 2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer:
> Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3
> {code}
> Thus we should not delete dead server from dead server list immediately.
> Patch to fix this problem will be upload later.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)