Jingyun Tian created HBASE-21565:
------------------------------------

             Summary: Delete dead server from dead server list too early leads 
to concurrent Server Crash Procedures(SCP) for a same server
                 Key: HBASE-21565
                 URL: https://issues.apache.org/jira/browse/HBASE-21565
             Project: HBase
          Issue Type: Bug
            Reporter: Jingyun Tian
            Assignee: Jingyun Tian


There are 2 kinds of SCP for a same server will be scheduled during cluster 
restart, one is ZK session timeout, the other one is new server report in will 
cause the stale one do fail over. The only barrier for these 2 kinds of SCP is 
check if the server is in the dead server list.
{code}
    if (this.deadservers.isDeadServer(serverName)) {
      LOG.warn("Expiration called on {} but crash processing already in 
progress", serverName);
      return false;
    }
{code}
But the problem is when master finish initialization, it will delete all stale 
servers from dead server list. Thus when the SCP for ZK session timeout come 
in, the barrier is already removed.
Here is the logs that how this problem occur.
{code}
2018-12-07,11:42:37,589 INFO 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9, 
state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
2018-12-07,11:42:58,007 INFO 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444, 
state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure 
server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
{code}
Now we can see two SCP are scheduled for the same server.
But the first procedure is finished after the second SCP starts.
{code}
2018-12-07,11:43:08,038 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9, 
state=SUCCESS, hasLock=false; ServerCrashProcedure 
server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false in 
30.5340sec
{code}
Thus it will leads the problem that regions will be assigned twice.
{code}
2018-12-07,12:16:33,039 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN, 
location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover, 
region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on 
server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise
{code}
And here we can see the server is removed from dead server list before the 
second SCP starts.
{code}
2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3
{code}

Thus we should not delete dead server from dead server list immediately.
Patch to fix this problem will be upload later.


 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to