[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613263#comment-16613263
 ] 

Duo Zhang commented on HBASE-21035:
-----------------------------------

But for InitMetaProcedure, it could happen in normal case, if master crashes 
after storing the InitMetaProcedure but before actually executing it. But for 
SCP, it could not happen in normal case. This is the main difference. We can 
not program for unknown issues. Removing all the procedure wals is only one 
possible way to produce this problem, but you do not know if this can also be 
caused by other problems. Instead, we should provide tools for operator to 
manually recover the disaster.

We also have the same concern for hbase 2.x, that HBCK1 is broken and can not 
be used, but HBCK2 is still not available. If there are bugs in code which 
causes critical problems for a cluster, we have no way to get the cluster back. 
It is not a good idea to tell users that 'if a procedure is stuck, then please 
just cleanup your cluster and setup a new one'...

So let's start helping on HBCK2? 

> Meta Table should be able to online even if all procedures are lost
> -------------------------------------------------------------------
>
>                 Key: HBASE-21035
>                 URL: https://issues.apache.org/jira/browse/HBASE-21035
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>         Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to