[
https://issues.apache.org/jira/browse/HBASE-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791516#comment-13791516
]
Feng Honghua commented on HBASE-5487:
-------------------------------------
bq.Master is the Actor. Having it go across a network to get/set the 'state' in
a service that is non-transactional wasn't our smartest move.
Regionservers currently report state via ZK. Master reads it from ZK. Would be
better if RS just reported directly to RS.
[~stack] Yes, this is exactly what I proposed in HBASE-9726 :-)
bq.I am wondering whether it makes sense to update the meta table from the
various regionservers on the region state changes or go via the master.. But
maybe the master doesn't need to be a bottleneck if possible. A regionserver
could first update the meta table, and then just notify the master that a
certain transition was done; the master could initiate the next transition
[~devaraj] It would be better to let master updates the meta table rather than
let various regionservers do it. Master being the single actor and
truth-maintainer can avoid many tricky bugs/problems. And for frequent state
changes to the meta table, the regionserver serving the (state) meta table
would be sooner the bottleneck than master which issues the update requests, so
whether it doesn't matter the update requests are from the master or from
various regionservers.
bq.I prefer not to use ZK since it's kind of the root cause of uncertainty: has
the master/region server got/processed the event? has the znode hijacked since
master/region server changes its mind?
We should store the state in meta table which is cached in the memory.
Whether to use coprocessor it is not a big concern to me. If we don't use
coprocessor, I prefer to use the master as the proxy to do all meta table
updates. Otherwise, we need to listen to something for updates.
[~jxiang] Agree. IMO ZK alone is not the root cause of uncertainty, the current
usage pattern of ZK is the root cause, the pattern that regionserver updates
state in ZK and master listens to the ZK and updates states in its local memory
accordingly exhibits too many tricky scenarios/bugs due to ZK watch is
one-time(which can result in missed state transition) and the
notification/process is asyncronous(which can lead to
delayed/non-update-to-date state in master memory). And by replacing ZK with
meta table, we also need to discard this 'RS updates - master listen' pattern
since meta table inherently lack listen-notify mechanism:-).
bq.I think ZK got a bad reputation not on its own merit, but on how we use it.
I can see that problems exist but IMHO advantages outweigh the disadvantages
compared to system table.
Co-located system table, I am not so sure, but so far there's no even
high-level design for this (for example - do all splits have to go thru
master/system table now? how does it recover? etc.).
Perhaps we should abstract an async persistence mechanism sufficiently and then
decide. Whether it would be ZK+notifications, or system table, or memory + wal,
or colocated system table, or what.
The problem is that the usage inside master of that interface would depend on
perf characteristics.
Anyway, we can work out the state transitions/concurrency/recovery without
tying 100% to particular store.
[~sershe] Agree on "ZK got a bad reputation not on its own merit, but on how we
use it.", especially if you mean currently master relies on ZK
watch/notification to maintain/update master's in-memory region state. IMO this
is almost the biggest root cause of current assignment design. If we just uses
ZK the same way as using meta table to storing states, it makes no that big
difference to store the states in ZK or meta table, right(except using meta
table can have much better performance for restart of a big cluster with large
amount of regions)? But using ZK's update/listen pattern does make the
difference.
bq.btw, any input on actor model?
Things queue up operations/notifications ("ops") for master; "AM" runs on timer
or when queue is non-empty, having as inputs, cluster state (incl. ongoing
internal actions it ordered before e.g. OPENING state for a region) plus new
ops from queue, on a single thread; generates new actions (not physically doing
anything e,g, talking to RS); the ops state and cluster state is persisted;
then actions are executed on different threads (e.g. messages sent to RS-es,
etc.), and "AM" runs again, or sleeps for some time if ops queue is empty.
That is a different model, not sure if it scales for large clusters.
[~sershe] "operations/notifications" means RS responses action progress to
master? Master is the single point to update the state "truth"(to meta table)
and RS doesn't know where the states are stored and doesn't access them
directly, right? I think a communication/storage diagram can help a lot for an
overall clear understanding here:-)
> Generic framework for Master-coordinated tasks
> ----------------------------------------------
>
> Key: HBASE-5487
> URL: https://issues.apache.org/jira/browse/HBASE-5487
> Project: HBase
> Issue Type: New Feature
> Components: master, regionserver, Zookeeper
> Affects Versions: 0.94.0
> Reporter: Mubarak Seyed
> Priority: Critical
> Attachments: Region management in Master.pdf
>
>
> Need a framework to execute master-coordinated tasks in a fault-tolerant
> manner.
> Master-coordinated tasks such as online-scheme change and delete-range
> (deleting region(s) based on start/end key) can make use of this framework.
> The advantages of framework are
> 1. Eliminate repeated code in Master, ZooKeeper tracker and Region-server for
> master-coordinated tasks
> 2. Ability to abstract the common functions across Master -> ZK and RS -> ZK
> 3. Easy to plugin new master-coordinated tasks without adding code to core
> components
--
This message was sent by Atlassian JIRA
(v6.1#6144)