[
https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754246#comment-16754246
]
Sergey Shelukhin edited comment on HBASE-21743 at 1/28/19 6:51 PM:
-------------------------------------------------------------------
Ok, I have less time for that now due to having to debug all the issues ;)
However split and merge are not covered by my "smaller" proposal, where master
(if configured) will ignore only recovery-related procedures.
During failure, master should already be able to handle not persisting the
state of some procedure (because by definition cluster is much more likely to
be in a bad state), so it should also be able to abandon old recovery
procedures (SCP & RIT and their children) as if they were not saved, and create
new ones during startup.
I will keep this JIRA for the larger feature (and later move the discussion to
dev@ when there's more time :)), and file a separate JIRA ( HBASE-21797) for
the recovery part...
was (Author: sershe):
Ok, I have less time for that now due to having to debug all the issues ;)
However split and merge are not covered by my "smaller" proposal, where master
(if configured) will ignore only recovery-related procedures.
During failure, master should already be able to handle not persisting the
state of some procedure (because by definition cluster is much more likely to
be in a bad state), so it should also be able to abandon old recovery
procedures (SCP & RIT and their children) as if they were not saved, and create
new ones during startup.
I will keep this JIRA for the larger feature (and later move the discussion to
dev@ when there's more time :)), and file a separate JIRA for the recovery
part...
> stateless assignment
> --------------------
>
> Key: HBASE-21743
> URL: https://issues.apache.org/jira/browse/HBASE-21743
> Project: HBase
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Priority: Major
>
> Running HBase for only a few weeks we found dozen(s?) of bugs with assignment
> that all seem to have the same nature - split brain between 2 procedures; or
> between procedure and master startup (meta replica bugs); or procedure and
> master shutdown (HBASE-21742); or procedure and something else (when SCP had
> incorrect region list persisted, don't recall the bug#).
> To me, it starts to look like a pattern where, like in AMv1 where concurrent
> interactions were unclear and hard to reason about, despite the cleaner
> individual pieces in AMv2 the problem of unclear concurrent interactions has
> been preserved and in fact increased because of the operation state
> persistence and isolation.
> Procedures are great for multi-step operations that need rollback and stuff
> like that, e.g. creating a table or snapshot, or even region splitting.
> However I'm not so sure about assignment.
> We have the persisted information - region state in meta (incl transition
> states like opening, or closing), server list as WAL directory list.
> Procedure state is not any more reliable then those (we can argue that meta
> update can fail, but so can procv2 WAL flush, so we have to handle cases of
> out of date information regardless). So, we don't need any extra state to
> decide on assignment, whether for recovery and balancing. In fact, as
> mentioned in some bugs, deleting procv2 WAL is often the best way to recover
> the cluster, because master can already figure out what to do without
> additional state.
> I think there should be an option for stateless assignment that does that.
> It can either be as a separate pluggable assignment procedure; or an option
> that will not recover SCP, RITs etc from WAL but always derive recovery
> procedures from the existing cluster state.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)