[
https://issues.apache.org/jira/browse/HBASE-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419556#comment-17419556
]
Anoop Sam John commented on HBASE-26287:
----------------------------------------
Why the SSH for the old RS (where NS region) not kicking in? You lost the
MasterProcWAL? I believe there is hbck2 option to assign NS region in such
stuck state. We should use such tools in this case IMO
> the initialization of master could not be completed when hbase:namesapce'
> region is not online
> ------------------------------------------------------------------------------------------------
>
> Key: HBASE-26287
> URL: https://issues.apache.org/jira/browse/HBASE-26287
> Project: HBase
> Issue Type: Improvement
> Components: master
> Affects Versions: 2.3.5
> Reporter: bolao
> Priority: Major
>
> hbase cluster unexpected shuts down and then restart, we sometimes find the
> master can't not initialize becouse of that it is stuck in isRegionOnline
> methad for Hbase:namespace。we found the master and meta table think the
> hbase:namespace region is online but it's regionserver is dead by viewing
> logs of master,isRegionOnline print log for this every one minute and don't
> do Nothing, I think we can remove record form assignmentManager's RegionState
> and assign hbase:namespace to another regionserver, in order to make hbase
> cluster recover without human intervention。i came to ask your advice, what do
> you think?
> {panel:title=the logs of master}
> 2021-09-02 18:32:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster]
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229)
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN,
> ts=1630577738198,
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
> ServerCrashProcedures=false. Master startup cannot progress, in
> holding-pattern until region onlined.
> 2021-09-02 18:33:01 [master/fx-hd-sc-hbase-master-0:16000.Chore.1] INFO
> org.apache.hadoop.hbase.ChoreService.scheduleChore(157) -Chore ScheduledChore
> name=fx-hd-sc-hbase-master-0.fx-hd-sc.fx-ns.svc.cluster.xjht,16000,1630401705440-ClusterStatusChore,
> period=60000, unit=MILLISECONDS is enabled.
> 2021-09-02 18:33:01 [master/fx-hd-sc-hbase-master-0:16000.Chore.1] INFO
> org.apache.hadoop.hbase.ScheduledChore.run(172) -Chore:
> fx-hd-sc-hbase-master-0.fx-hd-sc.fx-ns.svc.cluster.xjht,16000,1630401705440-ClusterStatusChore
> missed its start time
> 2021-09-02 18:33:41 [ProcExecTimeout] INFO
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334)
> -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown
> servers
> 2021-09-02 18:33:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster]
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229)
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN,
> ts=1630577738198,
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
> ServerCrashProcedures=false. Master startup cannot progress, in
> holding-pattern until region onlined.
> 2021-09-02 18:34:31 [qtp780802740-4192] INFO http.requests.master.write(60)
> -15.22.70.168 - - [02/Sep/2021:10:34:31 +0000] "GET
> //15.22.70.168:1601/master-status HTTP/1.1" 200 54124
> 2021-09-02 18:34:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster]
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229)
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN,
> ts=1630577738198,
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
> ServerCrashProcedures=false. Master startup cannot progress, in
> holding-pattern until region onlined.
> 2021-09-02 18:34:51 [qtp780802740-4202] INFO http.requests.master.write(60)
> -15.22.70.168 - - [02/Sep/2021:10:34:51 +0000] "GET
> //15.22.70.168:1601/master-status HTTP/1.1" 200 54122
> 2021-09-02 18:35:41 [ProcExecTimeout] INFO
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334)
> -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown
> servers
> 2021-09-02 18:35:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster]
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229)
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN,
> ts=1630577738198,
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
> ServerCrashProcedures=false. Master startup cannot progress, in
> holding-pattern until region onlined.
> 2021-09-02 18:36:20 [qtp780802740-4192] INFO http.requests.master.write(60)
> -15.22.70.168 - - [02/Sep/2021:10:36:20 +0000] "GET
> //15.22.70.168:1601/master-status HTTP/1.1" 200 54122
> 2021-09-02 18:36:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster]
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229)
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN,
> ts=1630577738198,
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
> ServerCrashProcedures=false. Master startup cannot progress, in
> holding-pattern until region onlined.
> 2021-09-02 18:36:57 [RSProcedureDispatcher-pool4-t23] WARN
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.scheduleForRetry(323)
> -request to
> fx-hd-sc-hbase-slave-15.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1630405458162
> failed due to org.apache.hadoop.hbase.ipc.CallTimeoutException: Call to
> fx-hd-sc-hbase-slave-15.fx-hd-sc.fx-ns.svc.cluster.xjht/172.49.9.38:16020
> failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
> Call[id=6192,methodName=ExecuteProcedures], waitTime=600008,
> rpcTimeout=600000, try=7, retrying...
> 2021-09-02 18:37:42 [ProcExecTimeout] INFO
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334)
> -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown
> servers
> 2021-09-02 18:37:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster]
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229)
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN,
> ts=1630577738198,
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
> ServerCrashProcedures=false. Master startup cannot progress, in
> holding-pattern until region onlined.
> 2021-09-02 18:38:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster]
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229)
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN,
> ts=1630577738198,
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
> ServerCrashProcedures=false. Master startup cannot progress, in
> holding-pattern until region onlined.
> 2021-09-02 18:38:49 [zk-event-processor-pool1-t1] INFO
> org.apache.hadoop.hbase.security.token.ZKSecretWatcher.nodeDeleted(94) -Node
> deleted id=168
> 2021-09-02 18:39:42 [ProcExecTimeout] INFO
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334)
> -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown
> servers
> 2021-09-02 18:39:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster]
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229)
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN,
> ts=1630577738198,
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
> ServerCrashProcedures=false. Master startup cannot progress, in
> holding-pattern until region onlined.
>
> {panel:title=the code of master}
> https://github.com/apache/hbase/blob/fd3fdc08d1cd43eb3432a1a70d31c3aece6ecabe/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L1214
> {panel}
>
> {panel}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)