[
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763853#comment-16763853
]
Bahram Chehrazy edited comment on HBASE-21844 at 2/8/19 7:50 PM:
-----------------------------------------------------------------
I have a new repro of this, except this time there is an pending SCP.
Obviously this path did not mitigate it, but the added logs shows that master
knows that the server was dead, but it's doing nothing except waiting for the
procedure to complete. I suggest adding a timeout to the loop and moving the
meta to another server if the timeout expire..
2019-02-08 04:07:53,716 WARN [master/BN01APA827CE794:16000:becomeActiveMaster]
master.ServerManager: Expiration called on bn01ap6e03f8370,16020,1549480448950
but crash processing already in progress
2019-02-08 04:07:59,851 INFO [master/****************:16000:becomeActiveMaster]
mortbay.log: Meta state is OPEN:
{1588230740 state=*OPEN*, ts=1549627673541,
server=*<server1>,16020,1549480448950*}
2019-02-08 04:07:59,851 WARN [master/****************:16000:becomeActiveMaster]
master.HMaster: Region hbase:meta,,1.1588230740 state is OPEN, but the server
*<server1>,16020,1549480448950* has crashed. Waiting for SCP to recover it.
2019-02-08 04:07:59,934 WARN [master/***************:16000:becomeActiveMaster]
master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
{1588230740 state=OPEN, ts=1549627673541,
server=*<server1>,16020,1549480448950*}
; ServerCrashProcedures=*true*. Master startup cannot progress, in
holding-pattern until region onlined.
201
was (Author: bahramch):
I have a new repro of this, except this time there is an pending SCP. But that
SCP may not even be for the meta region. Obviously this path did not mitigate
it, but the added logs shows that master knows that the server was dead, but
it's not doing anything except waiting for the procedure to completes. I
suggest adding a timeout to the loop and moving the meta to another server if
the timeout expire..
2019-02-08 08:41:02,811 INFO [master/****************:16000:becomeActiveMaster]
mortbay.log: Meta state is OPEN: {1588230740 state=*OPEN*, ts=1549627673541,
server=*<server1>,16020,1549480448950*}
2019-02-08 08:41:02,811 WARN [master/****************:16000:becomeActiveMaster]
master.HMaster: Region hbase:meta,,1.1588230740 state is OPEN, but the server
*<server1>,16020,1549480448950* has crashed. Waiting for SCP to recover it.
2019-02-08 08:41:02,811 WARN [master/***************:16000:becomeActiveMaster]
master.HMaster: hbase:meta,,1.1588230740 is NOT online; state={1588230740
state=OPEN, ts=1549627673541, server=*<server1>,16020,1549480448950*};
ServerCrashProcedures=*true*. Master startup cannot progress, in
holding-pattern until region onlined.
201
> Master could get stuck in initializing state while waiting for meta
> -------------------------------------------------------------------
>
> Key: HBASE-21844
> URL: https://issues.apache.org/jira/browse/HBASE-21844
> Project: HBase
> Issue Type: Bug
> Components: master, meta
> Affects Versions: 3.0.0
> Reporter: Bahram Chehrazy
> Assignee: Bahram Chehrazy
> Priority: Major
> Attachments:
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance
> of master getting into a state where the ZK says meta is OPEN, but the server
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted
> and the procWALs were corrupted). In this case the waitForMetaOnline never
> returns.
>
> We've seen this happening a few times when there had been a temporary HDFS
> outage. Following log lines shows this state.
>
> 2019-01-17 18:55:48,497 WARN [master/************:16000:becomeActiveMaster]
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227,
> server=*************,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in
> holding-pattern until region onlined.
>
> I'm still investigating why and how to prevent getting into this bad state,
> but nevertheless the master should be able to recover during a restart by
> initiating a new SCP to fix the meta.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)