[jira] [Commented] (HBASE-21844) Master could get stuck in initializing state while waiting for meta

stack (JIRA) Mon, 04 Feb 2019 22:57:36 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760514#comment-16760514
 ]


stack commented on HBASE-21844:
-------------------------------

On the patch:

On rename of the method, could go either way but the result is a boolean so the 
isRegion... seems more appropriate than wait?... (that it blocks unless error 
is noted in the javadoc). This is a nit.

The added logging is no harm (benefit actually) though this will get spewed a 
bunch ?

1235              LOG.warn("{} state is OPEN, but the server {} is dead. 
Waiting for SCP to recover it.",
1236                      ri.getRegionNameAsString(), rs.getServerName());

On this...

1238              LOG.error("{} State is OPEN, but the server {} is not online 
and no SCP is scheduled. Expiring the server.",
1239                      ri.getRegionNameAsString(), rs.getServerName());
1240              this.getServerManager().expireServer(rs.getServerName());

... we could be processing the dead server already?  You could check.

Yeah, if no SCP for this server, then the above would help but I'm interested 
in why no SCP scheduled. That seems like the more interesting issue. If we are 
failing to schedule an SCP or dropping one around startup, we should try and 
fix that.

Thank you.

> Master could get stuck in initializing state while waiting for meta
> -------------------------------------------------------------------
>
>                 Key: HBASE-21844
>                 URL: https://issues.apache.org/jira/browse/HBASE-21844
>             Project: HBase
>          Issue Type: Bug
>          Components: master, meta
>    Affects Versions: 3.0.0
>            Reporter: Bahram Chehrazy
>            Assignee: Bahram Chehrazy
>            Priority: Major
>         Attachments: 
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance 
> of master getting into a state where the ZK says meta is OPEN, but the server 
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted 
> and the procWALs were corrupted). In this case the waitForMetaOnline never 
> returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS 
> outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/************:16000:becomeActiveMaster] 
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, 
> server=*************,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in 
> holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, 
> but nevertheless the master should be able to recover during a restart by 
> initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21844) Master could get stuck in initializing state while waiting for meta

Reply via email to