[
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell resolved HBASE-27382.
-----------------------------------------
Assignee: (was: Rushabh Shah)
Resolution: Won't Fix
> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --------------------------------------------------------------------------
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.5.0, 1.7.2, 2.4.14
> Reporter: Rushabh Shah
> Priority: Major
>
> We are running some version of 1.7.2 in our production environment. We
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5
> master nodes. Co-incidentally active namenode and region server holding meta
> table were on the same physical node and that node went down due to hardware
> issue. We have sub optimal hdfs level timeouts configured so whenever active
> namenode goes down, it takes around 12-15 minutes for hdfs client within
> hbase to connect to new active namenode. So all the region servers were
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1]
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure
> serverName=<hmase-meta-RS>,61020,1662714013670, shouldSplitWal=true,
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager -
> Added=<hmase-meta-RS>,61020,1662714013670 to dead servers, submitted shutdown
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer -
> Started processing <hmase-meta-RS>,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO [ProcedureExecutor-0]
> procedure.ServerCrashProcedure - Start processing crashed
> <hmase-meta-RS>,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO [ProcedureExecutor-1] master.SplitLogManager -
> Started splitting 2 logs in
> [hdfs://<cluster-name>/hbase/WALs/<hmase-meta-RS>,61020,1662714013670-splitting]
> for [<hmase-meta-RS>,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread]
> coordination.SplitLogManagerCoordination - put up splitlog task at znode
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread]
> coordination.SplitLogManagerCoordination - put up splitlog task at znode
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting:
> +/hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active
> namenode, SplitLogManager tried total of 4 times to assign this task (3
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave
> up.
> {noformat}
> ---------- try 1 ---------------------
> 2022-09-09 16:59:06,205 INFO [main-EventThread]
> coordination.SplitLogManagerCoordination - task
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> acquired by <RS-1>,61020,1662540522069
> ---------- try 2 ---------------------
> 2022-09-09 17:01:06,642 INFO [ager__ChoreService_1]
> coordination.SplitLogManagerCoordination - resubmitting task
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread]
> coordination.SplitLogManagerCoordination - task not yet acquired
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> ver = 2
> 2022-09-09 17:01:06,715 INFO [main-EventThread]
> coordination.SplitLogManagerCoordination - task
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> acquired by <RS-2>,61020,1662530684713
> ---------- try 3 ---------------------
> 2022-09-09 17:03:07,643 INFO [ager__ChoreService_1]
> coordination.SplitLogManagerCoordination - resubmitting task
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread]
> coordination.SplitLogManagerCoordination - task not yet acquired
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> ver = 4
> 2022-09-09 17:03:07,738 INFO [main-EventThread]
> coordination.SplitLogManagerCoordination - task
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> acquired by <RS-3>,61020,1662542355806
> ---------- try 4 ---------------------
> 2022-09-09 17:05:08,684 INFO [ager__ChoreService_1]
> coordination.SplitLogManagerCoordination - resubmitting task
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:05:08,717 DEBUG [main-EventThread]
> coordination.SplitLogManagerCoordination - task not yet acquired
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> ver = 6
> 2022-09-09 17:05:08,757 INFO [main-EventThread]
> coordination.SplitLogManagerCoordination - task
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> acquired by <RS-4>,61020,1662543537599
> ------- after all retries exhausted ---------
> 2022-09-09 17:07:09,642 INFO [ager__ChoreService_1]
> coordination.SplitLogManagerCoordination - Skipping resubmissions of task
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> because threshold 3 reached
> {noformat}
> 6. After it exhausts all the retries, HMaster now will wait indefinitely for
> <RS-4> to complete the splitting. In this case the last region server died
> and hmaster couldn't assign hbase:meta table to any other region server.
> 7. We failed over the active hmaster and then the new active hmaster was able
> to successfully assign hbase:meta to another region server.
> Looking at the last region server logs (<RS-4>)
> It acquired the task of splitting log at +2022-09-09 17:05:08,732+ and got
> aborted at +2022-09-09 17:10:12,151+
> SplitLogWorker thread was terminated as a part of termination process.
> {noformat}
> 2022-09-09 17:05:08,732 INFO [-<RS-4>:61020]
> coordination.ZkSplitLogWorkerCoordination - worker <RS-4>,61020,1662543537599
> acquired task
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:10:12,151 FATAL [,queue=21,port=61020]
> regionserver.HRegionServer - ABORTING region server
> <RS-4>,61020,1662543537599: WAL sync failed, aborting to preserve WAL as
> source of truth
> 2022-09-09 17:10:12,185 INFO [-<RS-4>:61020] regionserver.SplitLogWorker -
> SplitLogWorker interrupted. Exiting.
> {noformat}
> +Possible Improvements+
> 1. Can we remove the maximum number of retries config
> hbase.splitlog.max.resubmit if table is hbase:meta table or for all the
> tables ?
> 2. If HMaster is not able to assign split wal tasks after
> hbase.splitlog.max.resubmit retries, then should we just self abort ?
> Any other suggestions are welcome.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)