[ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tak-Lon (Stephen) Wu updated HBASE-27382:
-----------------------------------------
    Fix Version/s: 2.4.18
                       (was: 2.4.17)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --------------------------------------------------------------------------
>
>                 Key: HBASE-27382
>                 URL: https://issues.apache.org/jira/browse/HBASE-27382
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.5.0, 1.7.2, 2.4.14
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>             Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.5, 2.4.18
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=<hmase-meta-RS>,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=<hmase-meta-RS>,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing <hmase-meta-RS>,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> <hmase-meta-RS>,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs://<cluster-name>/hbase/WALs/<hmase-meta-RS>,61020,1662714013670-splitting]
>  for [<hmase-meta-RS>,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> ---------- try 1 ---------------------
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by <RS-1>,61020,1662540522069
> ---------- try 2 ---------------------
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by <RS-2>,61020,1662530684713
> ---------- try 3 ---------------------
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 4
> 2022-09-09 17:03:07,738 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by <RS-3>,61020,1662542355806
> ---------- try 4 ---------------------
> 2022-09-09 17:05:08,684 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:05:08,717 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 6
> 2022-09-09 17:05:08,757 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by <RS-4>,61020,1662543537599
> ------- after all retries exhausted ---------
> 2022-09-09 17:07:09,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - Skipping resubmissions of task 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
>  because threshold 3 reached
> {noformat}
> 6. After it exhausts all the retries, HMaster now will wait indefinitely for 
> <RS-4> to complete the splitting. In this case the last region server died 
> and hmaster couldn't assign hbase:meta table to any other region server.
> 7. We failed over the active hmaster and then the new active hmaster was able 
> to successfully assign hbase:meta to another region server.
> Looking at the last region server logs (<RS-4>)
> It acquired the task of splitting log at +2022-09-09 17:05:08,732+ and got 
> aborted at +2022-09-09 17:10:12,151+
> SplitLogWorker thread was terminated as a part of termination process. 
> {noformat}
> 2022-09-09 17:05:08,732 INFO  [-<RS-4>:61020] 
> coordination.ZkSplitLogWorkerCoordination - worker <RS-4>,61020,1662543537599 
> acquired task 
> /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:10:12,151 FATAL [,queue=21,port=61020] 
> regionserver.HRegionServer - ABORTING region server 
> <RS-4>,61020,1662543537599: WAL sync failed, aborting to preserve WAL as 
> source of truth
> 2022-09-09 17:10:12,185 INFO  [-<RS-4>:61020] regionserver.SplitLogWorker - 
> SplitLogWorker interrupted. Exiting. 
> {noformat}
> +Possible Improvements+
> 1. Can we remove the maximum number of retries config 
> hbase.splitlog.max.resubmit if table is hbase:meta table or for all the 
> tables ?  
> 2. If HMaster is not able to assign split wal tasks after 
> hbase.splitlog.max.resubmit retries, then should we just self abort ? 
> Any other suggestions are welcome.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to