Sergey Shelukhin created HBASE-21743:
----------------------------------------
Summary: stateless assignment
Key: HBASE-21743
URL: https://issues.apache.org/jira/browse/HBASE-21743
Project: HBase
Issue Type: Bug
Reporter: Sergey Shelukhin
Running HBase for only a few weeks we found dozen(s?) of bugs with assignment
that all seem to have the same nature - split brain between 2 procedures; or
between procedure and master startup (meta replica bugs); or procedure and
master shutdown (HBASE-21742); or procedure and something else (when SCP had
incorrect region list persisted, don't recall the bug#).
To me, it starts to look like a pattern where, like in AMv1 where concurrent
interactions were unclear and hard to reason about, despite the cleaner
individual pieces in AMv2 the problem of unclear concurrent interactions has
been preserved and in fact increased because of the operation state persistence
and isolation.
Procedures are great for multi-step operations that need rollback and stuff
like that, e.g. creating a table or snapshot, or even region splitting. However
I'm not so sure about assignment.
We have the persisted information - region state in meta (incl transition
states like opening, or closing), server list as WAL directory list. Procedure
state is not any more reliable then those (we can argue that meta update can
fail, but so can procv2 WAL flush, so we have to handle cases of out of date
information regardless). So, we don't need any extra state to decide on
assignment, whether for recovery and balancing. In fact, as mentioned in some
bugs, deleting procv2 WAL is often the best way to recover the cluster, because
master can already figure out what to do without additional state.
I think there should be an option for stateless assignment that does that.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)