[
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763906#comment-16763906
]
Bahram Chehrazy commented on HBASE-21844:
-----------------------------------------
Digging further, I found master reporting several orphan and corrupt procedures
during start up.
2019-02-08 04:07:51,123 ERROR [master/*************:16000:becomeActiveMaster]
wal.WALProcedureTree: *Orphan procedure*: Procedure(pid=44283, ppid=44148,
class=org.apache.hadoop.hbase.master.assignment.*OpenRegionProcedure*)
2019-02-08 04:07:52,636 ERROR [master/*************:16000:becomeActiveMaster]
procedure2.ProcedureExecutor: *Corrupt* pid=44745, ppid=44143, state=SUCCESS,
hasLock=false; org.apache.hadoop.hbase.master.assignment.*OpenRegionProcedure*
The master can't resume the procedure for those, but still keeps them in the
deadServer preventing them from being expired later on, because it assumes
there is already one in progress. Seethe log below and check function
ServerManager.findDeadServersAndProcess.
2019-02-08 04:07:53,716 WARN [master/***************:16000:becomeActiveMaster]
master.ServerManager: Expiration called on *<server1>,16020,1549480448950* but
crash processing already in progress
*This prevents master from re-starting the procedure and waiting forever.*
> Master could get stuck in initializing state while waiting for meta
> -------------------------------------------------------------------
>
> Key: HBASE-21844
> URL: https://issues.apache.org/jira/browse/HBASE-21844
> Project: HBase
> Issue Type: Bug
> Components: master, meta
> Affects Versions: 3.0.0
> Reporter: Bahram Chehrazy
> Assignee: Bahram Chehrazy
> Priority: Major
> Attachments:
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance
> of master getting into a state where the ZK says meta is OPEN, but the server
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted
> and the procWALs were corrupted). In this case the waitForMetaOnline never
> returns.
>
> We've seen this happening a few times when there had been a temporary HDFS
> outage. Following log lines shows this state.
>
> 2019-01-17 18:55:48,497 WARN [master/************:16000:becomeActiveMaster]
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227,
> server=*************,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in
> holding-pattern until region onlined.
>
> I'm still investigating why and how to prevent getting into this bad state,
> but nevertheless the master should be able to recover during a restart by
> initiating a new SCP to fix the meta.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)