Sergey Shelukhin created HBASE-21797:
----------------------------------------

             Summary: more resilient master startup for bad cluster state
                 Key: HBASE-21797
                 URL: https://issues.apache.org/jira/browse/HBASE-21797
             Project: HBase
          Issue Type: Bug
            Reporter: Sergey Shelukhin


See HBASE-21743 for broader context.
During failure, master upon restart should already be able to handle having 
failed to persist the state of some procedures (because by definition cluster 
is much more likely to be in a bad state if master restarted due to some 
issue), so it should also be able to abandon old recovery procedures (SCP & RIT 
and their children) as if they were not saved, and create new ones during 
startup.

This should be off by default.

The idea is (some steps can be done in parallel as they are now, e.g. loading 
server list and meta): 
1) During proc WAL recovery do not recover SCP and open/close related procs.
2) Load server list as usual (dead and alive).
3) Recover meta vi either a a new SCP (or perhaps just a separate meta recovery 
proc without extra SCP steps, and leave the SCP for step 5), if it's on a dead 
server.
4) Load region list as usual.
5) Create SCPs for dead servers.
6) Reassign any regions on non-existent servers (we've seen some issues with 
this after SCP finishes but there are lots of HDFS errors and/or manual 
intervention, so master "forgets" the server ever existed and the region stays 
"open" there forever).
7) ? Look for other simple inconsistencies that don't require HBCK-level 
changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to