[
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761243#comment-16761243
]
stack commented on HBASE-21844:
-------------------------------
bq. We are basically running master
Oh. Ok. You are aware that Master is not in any release. You are the first to
try it (master is generally dumping ground until some poor RM soul shows up to
clean up the mess and make a release). Can you come back to a 2.1 if you want
most stable or a 2.2 if you want to help out w/ next release?
Staring Master with a force-recover flag is an idea that has been raised in the
past. We've usually cast it as a sole (minimal-)master that comes up against
which you'd run hbck2 instruction. I like your suggestion which goes beyond
this of a self-repairing Master. I was thinking we had to go through the
minimal master stage first building out hbck2 vocabulary....
Thanks [~sershe]
> Master could get stuck in initializing state while waiting for meta
> -------------------------------------------------------------------
>
> Key: HBASE-21844
> URL: https://issues.apache.org/jira/browse/HBASE-21844
> Project: HBase
> Issue Type: Bug
> Components: master, meta
> Affects Versions: 3.0.0
> Reporter: Bahram Chehrazy
> Assignee: Bahram Chehrazy
> Priority: Major
> Attachments:
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance
> of master getting into a state where the ZK says meta is OPEN, but the server
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted
> and the procWALs were corrupted). In this case the waitForMetaOnline never
> returns.
>
> We've seen this happening a few times when there had been a temporary HDFS
> outage. Following log lines shows this state.
>
> 2019-01-17 18:55:48,497 WARN [master/************:16000:becomeActiveMaster]
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227,
> server=*************,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in
> holding-pattern until region onlined.
>
> I'm still investigating why and how to prevent getting into this bad state,
> but nevertheless the master should be able to recover during a restart by
> initiating a new SCP to fix the meta.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)