[
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608254#comment-16608254
]
Duo Zhang commented on HBASE-21035:
-----------------------------------
If the cluster is not in a good state and several RSes keep crashing, the
master will hang there forever, if you need to wait until all SCPs to
finish...And how do you determine the meta is on a stale server programmingly?
And my concern is that, there will be races, as server crash can happen at any
time, include the logic in the start up code path will make the code flaky...
We can have a tool to do something like this, But when to use it should be
decided by human. Maybe we could do more checks and print something in log that
the state seems not correct, please check XXX and XXX to see if there are
something wrong and try XXX tool?
> Meta Table should be able to online even if all procedures are lost
> -------------------------------------------------------------------
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 2.1.0
> Reporter: Allan Yang
> Assignee: Allan Yang
> Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and
> if all the procedure wals are lost (due to bug, or deleted manually,
> whatever), the new restarted master will be stuck when initing. Since no one
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need
> to online meta region. Otherwise, we are sitting ducks, noting can be done.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)