Sergey Shelukhin created HBASE-21797:
----------------------------------------
Summary: more resilient master startup for bad cluster state
Key: HBASE-21797
URL: https://issues.apache.org/jira/browse/HBASE-21797
Project: HBase
Issue Type: Bug
Reporter: Sergey Shelukhin
See HBASE-21743 for broader context.
During failure, master upon restart should already be able to handle having
failed to persist the state of some procedures (because by definition cluster
is much more likely to be in a bad state if master restarted due to some
issue), so it should also be able to abandon old recovery procedures (SCP & RIT
and their children) as if they were not saved, and create new ones during
startup.
This should be off by default.
The idea is (some steps can be done in parallel as they are now, e.g. loading
server list and meta):
1) During proc WAL recovery do not recover SCP and open/close related procs.
2) Load server list as usual (dead and alive).
3) Recover meta vi either a a new SCP (or perhaps just a separate meta recovery
proc without extra SCP steps, and leave the SCP for step 5), if it's on a dead
server.
4) Load region list as usual.
5) Create SCPs for dead servers.
6) Reassign any regions on non-existent servers (we've seen some issues with
this after SCP finishes but there are lots of HDFS errors and/or manual
intervention, so master "forgets" the server ever existed and the region stays
"open" there forever).
7) ? Look for other simple inconsistencies that don't require HBCK-level
changes.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)