[
https://issues.apache.org/jira/browse/HBASE-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791007#comment-13791007
]
Sergey Shelukhin commented on HBASE-5487:
-----------------------------------------
Big response to not-responded-to recent comments.
Let me update the doc, EOW-ish probably depending on the number of bugs
surfacing ;)
[~stack]
Let's keep discussion and doc here and branch tasks out for rewrites.
bq. + The problem section is too short (state kept in multiple places and all
have to agree...); need more full list so can be sure proposal addresses them
all
What level of detail do you have in mind? It's not a bug fix, so I cannot
really say "merge races with snapshot", or something like that; that could also
be arguably resolved by another 100k patch to existing AM :)
bq. + How is the proposal different from what we currently have? I see us tying
regionstate to table state. That is new. But the rest, where we have a record
and it is atomically changed looks like our RegionState in Master memory? There
is an increasing 'version' which should help ensure a 'direction' for change
which should help.
See the design principles (and below discussion :)). We are trying to avoid
multiple flavors of split-brain state.
bq. Its fine having a source of truth but ain't the hard part bring the system
along? (meta edits, clients, etc.).
Yes :)
bq. Experience has zk as messy to reason with. It is also an indirection having
RS and M go to zk to do 'state'.
I think ZK got a bad reputation not on its own merit, but on how we use it.
I can see that problems exist but IMHO advantages outweigh the disadvantages
compared to system table.
Co-located system table, I am not so sure, but so far there's no even
high-level design for this (for example - do all splits have to go thru
master/system table now? how does it recover? etc.).
Perhaps we should abstract an async persistence mechanism sufficiently and then
decide. Whether it would be ZK+notifications, or system table, or memory + wal,
or colocated system table, or what.
The problem is that the usage inside master of that interface would depend on
perf characteristics.
Anyway, we can work out the state transitions/concurrency/recovery without
tying 100% to particular store.
bq. + Agree that master should become a lib that any regionserver can run.
That sounds possible.
[~nkeywal]
bq. At least, we should make this really testable, without needing to set up a
zk, a set of rs and so on.
+1, see my comment above.
bq. I really really really ( ) think that we need to put performances as a
requirement for any implementation. For example, something like: on a cluster
with 5 racks of 20 regionserver each, with 200 regions per RS,, the assignment
will be completed in 1s if we lose one rack. I saw a reference to async ZK in
the doc, it's great, because the performances are 10 times better.
We can measure and improve, but I am not really sure about what exact numbers
will be, at this stage (we don't even know what storage is).
[~devaraj]
bq. A regionserver could first update the meta table, and then just notify the
master that a certain transition was done; the master could initiate the next
transition (Elliott Clark comment about coprocessor can probably be made to
apply in this context). Only when a state change is recorded in meta, the
operation is considered successful.
Split, for example, requires several changes to meta. Will master be able to
see them together from the hook? If master is collocated in the same RS with
meta, it should be small overhead to have master RPC.
bq. Also, there is a chore (probably enhance catalog-janitor) in the master
that periodically goes over the meta table and restarts (along with some
diagnostics; probing regionservers in question etc.) failed/stuck state
transitions.
+1 on that. Transition states can indicate the start ts, and master will know
when they started.
bq. I think we should also save the operations that was initiated by the client
on the master (either in WAL or in some system table) so that the master
doesn't lose track of those and can execute them in the face of crashes &
restarts. For example, if the user had sent a 'split region' operation and the
master crashed
Yeah, "disable table" or "move region" are a good example. Probably we'd need
ZK/system table/WAL for ongoing logical operations.
[~jxiang]
bq. We should not have another janitor/chore. If an action is failed, it must
be because of something unrecoverable by itself, not because of a bug in our
code. It should stay failed until the issue is resolved.
I think the failures meant are things like RS went away, is slow or buggy, so
OPENING got stuck - someone needs to pick it up over timeout.
bq. We need to have something like FATE in accumulo to queue/retry actions
taking several steps like split/merge/move.
We basically need something that allows atomic state changes. HBase or ZK or
mem+wal fit the bill :)
> Generic framework for Master-coordinated tasks
> ----------------------------------------------
>
> Key: HBASE-5487
> URL: https://issues.apache.org/jira/browse/HBASE-5487
> Project: HBase
> Issue Type: New Feature
> Components: master, regionserver, Zookeeper
> Affects Versions: 0.94.0
> Reporter: Mubarak Seyed
> Priority: Critical
> Attachments: Region management in Master.pdf
>
>
> Need a framework to execute master-coordinated tasks in a fault-tolerant
> manner.
> Master-coordinated tasks such as online-scheme change and delete-range
> (deleting region(s) based on start/end key) can make use of this framework.
> The advantages of framework are
> 1. Eliminate repeated code in Master, ZooKeeper tracker and Region-server for
> master-coordinated tasks
> 2. Ability to abstract the common functions across Master -> ZK and RS -> ZK
> 3. Easy to plugin new master-coordinated tasks without adding code to core
> components
--
This message was sent by Atlassian JIRA
(v6.1#6144)