[ https://issues.apache.org/jira/browse/HBASE-27383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rushabh Shah reassigned HBASE-27383: ------------------------------------ Assignee: (was: Rushabh Shah) > Add dead region server to SplitLogManager#deadWorkers set as the first step. > ---------------------------------------------------------------------------- > > Key: HBASE-27383 > URL: https://issues.apache.org/jira/browse/HBASE-27383 > Project: HBase > Issue Type: Bug > Affects Versions: 2.5.0, 1.7.2, 2.4.14 > Reporter: Rushabh Shah > Priority: Major > > Currently we add a dead region server to +SplitLogManager#deadWorkers+ set in > SERVER_CRASH_SPLIT_LOGS state. > Consider a case where a region server is handling split log task for > hbase:meta table and SplitLogManager has exhausted all the retries and won't > try any more region server. > The region server which is handling split log task has died. > We have a check in SplitLogManager where if a region server is declared dead > and if that region server is responsible for split log task then we > forcefully resubmit split log task. See the code > [here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java#L721-L726]. > But we add a region server to SplitLogManager#deadWorkers set in > [SERVER_CRASH_SPLIT_LOGS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L252] > state. > Before that it runs > [SERVER_CRASH_GET_REGIONS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L214] > state and checks if hbase:meta table is up. In this case, hbase:meta table > was not online and that prevented SplitLogManager to add this RS to > deadWorkers list. This created a deadlock and hbase cluster was completely > down for an extended period of time until we failed over active hmaster. See > HBASE-27382 for more details. > Improvements: > 1. We should a dead region server to +SplitLogManager#deadWorkers+ list as > the first step. -- This message was sent by Atlassian Jira (v8.20.10#820010)