[ 
https://issues.apache.org/jira/browse/HBASE-27383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HBASE-27383:
---------------------------------
    Affects Version/s: 2.5.0
                           (was: 1.6.0)

> Add dead region server to SplitLogManager#deadWorkers set as the first step.
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-27383
>                 URL: https://issues.apache.org/jira/browse/HBASE-27383
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.5.0, 1.7.2, 2.4.14
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>
> Currently we add a dead region server to +SplitLogManager#deadWorkers+ set in 
> SERVER_CRASH_SPLIT_LOGS state. 
> Consider a case where a region server is handling split log task for 
> hbase:meta table and SplitLogManager has exhausted all the retries and won't 
> try any more region server. 
> The region server which is handling split log task has died. 
> We have a check in SplitLogManager where if a region server is declared dead 
> and if that region server is responsible for split log task then we 
> forcefully resubmit split log task. See the code 
> [here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java#L721-L726].
> But we add a region server to SplitLogManager#deadWorkers set in 
> [SERVER_CRASH_SPLIT_LOGS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L252]
>  state. 
> Before that it runs 
> [SERVER_CRASH_GET_REGIONS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L214]
>  state  and checks if hbase:meta table is up. In this case, hbase:meta table 
> was not online and that prevented SplitLogManager to add this RS to 
> deadWorkers list. This created a deadlock and hbase cluster was completely 
> down for an extended period of time until we failed over active hmaster. See 
> HBASE-27382 for more details.
> Improvements:
> 1.  We should a dead region server to +SplitLogManager#deadWorkers+ list as 
> the first step.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to