[ 
https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749406#comment-16749406
 ] 

Sergey Shelukhin commented on HBASE-21743:
------------------------------------------

We've been running a master snapshot. Indeed, we found that sometimes procv2 
deletion can lead to additional issues, however sometimes it's also the only 
way forward. 
[~stack] there's no way to find out about old dead servers on restart other 
than WAL directories (or inferring from stale region assignments stored in 
meta), because the servers are not stored anywhere else (and ZK node is gone 
for a dead server, as intended).

The basic idea is to look at list of regions (meta), look at live and dead 
servers - both of which master already does - and schedule procedures from 
scratch as required, instead of relying on procedure WAL. 
Personally (as we've discussed years ago ;)) I would prefer to have something 
like actor model where a central fast actor does this in a loop and fires off 
idempotent slow actions asynchronously, but within the current paradigm I think 
 reducing state (optionally i.e. with a config) would provide some benefit. 
Right now for every bug I file (and all those I don't file that result from 
subtly-incorrect/too-aggressive manual interventions needed to address other 
bugs)  if master was looking at cluster state it would be trivial to resolve, 
but because of the split brain problem every part of the system is waiting for 
some other part with incorrect assumptions; so, the whole thing is very fragile 
w.r.t. both bugs, and also manual intervention that as we know are often 
necessary despite best intentions (hence hbck/offlinerepair/etc).

For example the above bug with incorrect SCP for meta server resulted because 
master init is waiting for SCP to fix meta, but SCP doesn't know it needs to 
fix meta because of some bug. OFC if persistent SCP didn't exist it wouldn't 
have the bug in the first place, but abstractly if one actor was looking at 
this he'd just see meta assigned to a dead server, and recover it just like 
that. No state needed other than where meta is and the list of servers.

Then, to resolve this we had to nuke the proc WAL to get rid of the bad SCP. 
Some more SCPs for some servers got lost in the nuke, and we had some regions 
CLOSING on dead servers that have neither SCP nor WAL directory. Again, looking 
from a unified perspective we can see - woops, region closing on the server, 
server has no WALs to split - just count it as closed. Whereas now close region 
procedure is not responsible for this, it just waits for SCP to deal with the 
server. But there's no SCP because there's no WAL directory. So, nobody looks 
at these two together... so after this manual intervention (or for example 
imagine there was an HDFS issue, and the WAL write did not succeed) cluster is 
broken and I have to go and fix those regions.

Now I go to meta and set regions to CLOSED (pretend I'm actually hbck2). If 
assignment was stateless, master would see closed regions and assign them. 
Whereas now confirm-close retry loop is well-isolated so it doesn't care about 
anything in the world and just blindly resets them back to CLOSING, so I have 
to additionally kill -9 the master to make sure that stupid RITs go away and on 
restart master actually recovers the region.

Luckily when recovered RIT procedures in this case see CLOSED region with empty 
server, they just silently go away (which might technically be a bug but it 
works for me ;)); I've seen other cases where some procedure sees region in an 
unexpected state (due to a race condition) it either fails master (as with meta 
replicas) or updates it to some other state, resulting in a strange state.

This is just one example. And on all 3.5 steps the persistent procedure is 100% 
unnecessary, because master has all the information to make correct decisions. 
As long as it's done in a sane way like with hybrid actor model without its own 
persistent state...



> stateless assignment
> --------------------
>
>                 Key: HBASE-21743
>                 URL: https://issues.apache.org/jira/browse/HBASE-21743
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> Running HBase for only a few weeks we found dozen(s?) of bugs with assignment 
> that all seem to have the same nature - split brain between 2 procedures; or 
> between procedure and master startup (meta replica bugs); or procedure and 
> master shutdown (HBASE-21742); or procedure and something else (when SCP had 
> incorrect region list persisted, don't recall the bug#). 
> To me, it starts to look like a pattern where, like in AMv1 where concurrent 
> interactions were unclear and hard to reason about, despite the cleaner 
> individual pieces in AMv2 the problem of unclear concurrent interactions has 
> been preserved and in fact increased because of the operation state 
> persistence and  isolation.
> Procedures are great for multi-step operations that need rollback and stuff 
> like that, e.g. creating a table or snapshot, or even region splitting. 
> However I'm not so sure about assignment. 
> We have the persisted information - region state in meta (incl transition 
> states like opening, or closing), server list as WAL directory list. 
> Procedure state is not any more reliable then those (we can argue that meta 
> update can fail, but so can procv2 WAL flush, so we have to handle cases of 
> out of date information regardless). So, we don't need any extra state to 
> decide on assignment, whether for recovery and balancing. In fact, as 
> mentioned in some bugs, deleting procv2 WAL is often the best way to recover 
> the cluster, because master can already figure out what to do without 
> additional state.
> I think there should be an option for stateless assignment that does that.
> It can either be as a separate pluggable assignment procedure; or an option 
> that will not recover SCP, RITs etc from WAL but always derive recovery 
> procedures from the existing cluster state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to