[
https://issues.apache.org/jira/browse/HBASE-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616129#comment-13616129
]
Jonathan Hsieh commented on HBASE-5487:
---------------------------------------
To do a major overhaul, we need something stronger than "the code is hard to
read". I agree that it is hard to follow (see:
http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf) but it
seems to be basically working which is a pretty strong argument. Let's compare
and point out what is wrong/broken in the current implementation and how the
new design won't have those problems.
The spreadsheet link is my first step to enumerating semantics and distilling
the set of possible problems and things that are being guarded from races. Any
major-overhaul solution should make sure that these operations, when issued
concurrently, interact according to a sane set of semantics in the face of
failures.
bq. Only for the current document version... tables could be added
So I buy open/close as a region operation. split/merge are multi region
operations -- is there enough state to recover from a failure?
So alter table is a region operation? Why isn't it in the state machine?
bq. Hmm... that would require implementing region locks, and having a very
large cluster. I am talking more about unacceptable blocking of user
operations, and management of expiring locks in presense of real-life failures.
Implementing region locks is too far -- I'm asking for some back of the napkin
discussionb. I think we need some measurements how much throughput we can get
in ZK or with a ZK-lock implementation and compare his with # rs of watchers *
# of regions * number of ops..
The current regions-in-transition (RIT) code basically assumes that an absent
znode is either closed or opened. RIT znodes are present when the region is in
the inbetween states (opening, closing,
bq. You mean like WAL for operations?
Yeah, we could call it an "intent" log. It would have info so that a promoted
backup master can look in one place and complete an operation started by the
downed original master.
bq. ... Also usually that would mean RSes won't be able to initiate operations
(like split) - they will have to go thru master (which I would argue is ok).
I know I've suggested something like this before. Currently the RS initiates a
split, and does the region open/meta changes. If there are errors, at some
point the master side detects a timeout. An alternative would have splits
initiated RS on the rs but have the master do some kind of atomic changes to
meta and region state for the 3 involved regions (parent, daughter a and
daughter b).
bq. Depends on where we store it, but yeah these have to be transactional. Last
section (very short ) suggests using ZK, which already supports that.
We need to be careful about ZK -- since it is a network connection also,
exceptions could be failures or timeouts (which succeed but wan't able to ack).
If we can describe the properties (durable vs erasable) and assumptions (if
the wipeable ZK is source of truth, how do we make sure the version state is
recoverable without time travel?)
> Generic framework for Master-coordinated tasks
> ----------------------------------------------
>
> Key: HBASE-5487
> URL: https://issues.apache.org/jira/browse/HBASE-5487
> Project: HBase
> Issue Type: New Feature
> Components: master, regionserver, Zookeeper
> Affects Versions: 0.94.0
> Reporter: Mubarak Seyed
> Attachments: Region management in Master.pdf
>
>
> Need a framework to execute master-coordinated tasks in a fault-tolerant
> manner.
> Master-coordinated tasks such as online-scheme change and delete-range
> (deleting region(s) based on start/end key) can make use of this framework.
> The advantages of framework are
> 1. Eliminate repeated code in Master, ZooKeeper tracker and Region-server for
> master-coordinated tasks
> 2. Ability to abstract the common functions across Master -> ZK and RS -> ZK
> 3. Easy to plugin new master-coordinated tasks without adding code to core
> components
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira