[ 
https://issues.apache.org/jira/browse/HBASE-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791007#comment-13791007
 ] 

Sergey Shelukhin commented on HBASE-5487:
-----------------------------------------

Big response to not-responded-to recent comments.
Let me update the doc, EOW-ish probably depending on the number of bugs 
surfacing ;)

[~stack]
Let's keep discussion and doc here and branch tasks out for rewrites.
bq. + The problem section is too short (state kept in multiple places and all 
have to agree...); need more full list so can be sure proposal addresses them 
all
What level of detail do you have in mind? It's not a bug fix, so I cannot 
really say "merge races with snapshot", or something like that; that could also 
be arguably resolved by another 100k patch to existing AM :)
bq. + How is the proposal different from what we currently have? I see us tying 
regionstate to table state. That is new. But the rest, where we have a record 
and it is atomically changed looks like our RegionState in Master memory? There 
is an increasing 'version' which should help ensure a 'direction' for change 
which should help.
See the design principles (and below discussion :)). We are trying to avoid 
multiple flavors of split-brain state.
bq. Its fine having a source of truth but ain't the hard part bring the system 
along? (meta edits, clients, etc.).
Yes :)
bq. Experience has zk as messy to reason with. It is also an indirection having 
RS and M go to zk to do 'state'.
I think ZK got a bad reputation not on its own merit, but on how we use it.
I can see that problems exist but IMHO advantages outweigh the disadvantages 
compared to system table.
Co-located system table, I am not so sure, but so far there's no even 
high-level design for this (for example - do all splits have to go thru 
master/system table now? how does it recover? etc.).
Perhaps we should abstract an async persistence mechanism sufficiently and then 
decide. Whether it would be ZK+notifications, or system table, or memory + wal, 
or colocated system table, or what.
The problem is that the usage inside master of that interface would depend on 
perf characteristics.
Anyway, we can work out the state transitions/concurrency/recovery without 
tying 100% to particular store.

bq. + Agree that master should become a lib that any regionserver can run.
That sounds possible.

[~nkeywal]
bq. At least, we should make this really testable, without needing to set up a 
zk, a set of rs and so on.
+1, see my comment above. 
bq. I really really really ( ) think that we need to put performances as a 
requirement for any implementation. For example, something like: on a cluster 
with 5 racks of 20 regionserver each, with 200 regions per RS,, the assignment 
will be completed in 1s if we lose one rack. I saw a reference to async ZK in 
the doc, it's great, because the performances are 10 times better.
We can measure and improve, but I am not really sure about what exact numbers 
will be, at this stage (we don't even know what storage is).


[~devaraj]
bq. A regionserver could first update the meta table, and then just notify the 
master that a certain transition was done; the master could initiate the next 
transition (Elliott Clark comment about coprocessor can probably be made to 
apply in this context). Only when a state change is recorded in meta, the 
operation is considered successful.
Split, for example, requires several changes to meta. Will master be able to 
see them together from the hook? If master is collocated in the same RS with 
meta, it should be small overhead to have master RPC.

bq. Also, there is a chore (probably enhance catalog-janitor) in the master 
that periodically goes over the meta table and restarts (along with some 
diagnostics; probing regionservers in question etc.) failed/stuck state 
transitions. 
+1 on that. Transition states can indicate the start ts, and master will know 
when they started.

bq. I think we should also save the operations that was initiated by the client 
on the master (either in WAL or in some system table) so that the master 
doesn't lose track of those and can execute them in the face of crashes & 
restarts. For example, if the user had sent a 'split region' operation and the 
master crashed
Yeah, "disable table" or "move region" are a good example. Probably we'd need 
ZK/system table/WAL for ongoing logical operations.

[~jxiang]
bq. We should not have another janitor/chore. If an action is failed, it must 
be because of something unrecoverable by itself, not because of a bug in our 
code. It should stay failed until the issue is resolved.
I think the failures meant are things like RS went away, is slow or buggy, so 
OPENING got stuck - someone needs to pick it up over timeout.

bq. We need to have something like FATE in accumulo to queue/retry actions 
taking several steps like split/merge/move.
We basically need something that allows atomic state changes. HBase or ZK or 
mem+wal fit the bill :)



> Generic framework for Master-coordinated tasks
> ----------------------------------------------
>
>                 Key: HBASE-5487
>                 URL: https://issues.apache.org/jira/browse/HBASE-5487
>             Project: HBase
>          Issue Type: New Feature
>          Components: master, regionserver, Zookeeper
>    Affects Versions: 0.94.0
>            Reporter: Mubarak Seyed
>            Priority: Critical
>         Attachments: Region management in Master.pdf
>
>
> Need a framework to execute master-coordinated tasks in a fault-tolerant 
> manner. 
> Master-coordinated tasks such as online-scheme change and delete-range 
> (deleting region(s) based on start/end key) can make use of this framework.
> The advantages of framework are
> 1. Eliminate repeated code in Master, ZooKeeper tracker and Region-server for 
> master-coordinated tasks
> 2. Ability to abstract the common functions across Master -> ZK and RS -> ZK
> 3. Easy to plugin new master-coordinated tasks without adding code to core 
> components



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to