Allan Yang commented on HBASE-21035:

You bring meta online and mess up all the data and cause unrecoverable data 
I can't think of any condition that bringing meta online will cause data loss 
or any other unrecoverable cases…

In HBase-1.x, if something wrong with the AssignmentManager, restarting the 
master can fix the dilemma in most cases. In some catastrophic scenario, we can 
even delete all the RIT Znodes and assign them again. Since if only HDFS/ZK is 
normal and RS can work normally, the meta region can online at least no matter 

But, this is not the case with AMv2,  All the states and procedures are 
persisted, restarting master will result in the same state before restarting 
(we are trying hard to ensure it...). Restarting master won't help like before, 
and also it is hard to interfere with procedures. That means we are not easy to 
recover the system if there is any bugs in AMv2(which is very likely...).

In some cases, we indeed need to delete all procedures making it clean for 
recovering. As it addressed in a doc('Fixing regions stuck in transition in 
HBase 2.0
') in HBASE-19121.
But, if a clean start still causing the system to hang, it is hard to let other 
fix tools like HBCK to kick in.

For me, crash or hang is much much better than doing dangerous operations in 
>From my point of view, It is very essential that a production ready system 
>that can recover without change any code.  

> Meta Table should be able to online even if all procedures are lost
> -------------------------------------------------------------------
>                 Key: HBASE-21035
>                 URL: https://issues.apache.org/jira/browse/HBASE-21035
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>         Attachments: HBASE-21035.branch-2.0.001.patch
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.

This message was sent by Atlassian JIRA

Reply via email to