[
https://issues.apache.org/jira/browse/HBASE-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799619#comment-13799619
]
Sergey Shelukhin edited comment on HBASE-5487 at 10/18/13 10:50 PM:
--------------------------------------------------------------------
Answers lifted from email also (some fixes + one answer was modified due to
clarification here :)).
bq. What is a failure and how do you react to failures? I think the master5
design needs to spend more effort to considering failure and recovery cases. I
claim there are 4 types of responses from a networked IO operation - two
states we normally deal with ack successful, ack failed (nack) and unknown due
to timeout that succeeded (timeout success) and unknown due to timeout that
failed (timeout failed). We have historically missed the last two cases and
they aren't considered in the master5 design.
There are a few considerations. Let me examine if there are other cases than
these.
I am assuming the collocated table, which should reduce such cases for state
(probably, if collocated table cannot be written reliably, master must
stop-the-world and fail over).
When RS contacts master to do state update, it errs on the side of caution - no
state update, no open region (or split).
Thus, except for the case of multiple masters running, we can always assume RS
didn't online the region if we don't know about it.
Then, for messages to RS, see "Note on messages"; they are idempotent so they
can always be resent.
bq. 1) State update coordination. What is a "state updates from the outside"
Do RS's initiate splitting on their own? Maybe a picture would help so we can
figure out if it is similar or different from hbck-master's?
Yes, these are RS messages. They are mentioned in some operation descriptions
in part 2 - opening->opened, closing->closed; splitting, etc.
bq. 2) Single point of truth. hbck-master tries to define what single point
of truth means by defining intended, current, and actual state data with
durability properties on each kind. What do clients look at who modifies what?
Sorry, don't understand the question. I mean single source of truth mainly
about what is going on with the region; it is described in design
considerations.
I like the idea of "intended state", however without more detailed reading I am
not sure how it works for multiple ops e.g. master recovering the region while
the user intends to split it, so the split should be executed after it's opened.
bq. 3) Table record: "if regions is out of date, it should be closed and
reopened". It is not clear in master5 how regionservers find out that they are
out of date. Moreover, how do clients talking to those RS's with stale versions
know they are going to the correct RS especially in the face of RS failures due
to timeout?
On alter (and startup if failed), master tries to reopen all regions that are
out of date.
Regions that are not opened with either pick up the new version when they are
opened, or (e.g. if they are now Opening with old version) master discovers
they are out of date when they are transitioned to Opened by RS, and reopens
them again.
As for any case of alter on enabled table, there are no guarantees for clients.
To provide these w/o disable/enable (or logical equivalent of coordinating all
close-s and open-s), one would need some form of version-time-travel, or
waiting for versions, or both.
bq. 4) region record: transition states. This is really similar to
hbck-masters current state and intended state. Shouldn't be defined as part of
the region record?
I mention somewhere that could be done. One thing is that if several paths are
possible between states, it's useful to know which is taken.
But do note that I store user intent separately from what is currently going
on, so they are not exactly similar as far as I see.
bq. 5) Note on user operations: the forgetting thing is scary to me -- in your
move split example, what happens if an RS reads state that is forgotten?
I think my description of this might be too vague. State is not forgotten;
previous intent is forgotten. I.e. if user does several operations in order
that conflict (e.g. split and then merge), the first one will be canceled
(safely :)).
Also, RS does not read state as a guideline to what needs to be done.
bq. 6) table state machine. how do we guarantee clients are writing from the
correct version in the in failures?
Can you please elaborate?
bq. 7) region state machine. Earlier draft hand splitting and merge cases.
Are they elided in master5 or are not present any more. How would this get
extended handle jeffrey's distributed log replay/fast write recovery feature?
As I mention somewhere these could be separate states. I was kind of afraid of
blowing up state machine too much, so I noticed that for split/merge you anyway
store siblings/children, so you can recognize them and for most purposes
different split-merge states are the same as Opened and Closed.
I will add those back, it would make sense.
bq. 8) logical interactions: sounds like master5 allows concurrent region and
table operations. hbck-master (though not fully documented) only allows
certain region transitions when the table is enabled or if the table is
disabled. Are we sure we don't get into race conditions? What happens if
disable gets issued -- its possible for someone to reopens the region and for
old clients to continue writing to it even though it is closed?
Yes, parallelism is intended. You can never be sure you have no races but we
should aim for it :)
master5 is missing disabled/enabled check, that is a mistake.
Part1 operation interactions already cover it:
table disable doesn't ack until all regions are closed (master5 is wrong
:().
region opening cannot start if table is already disabling or disabled.
if region is already opening when disable is issued, opening will be
opportunistically canceled.
if disable fails to cancel opening, or server opens it first in a race,
region will be opened, and master will issue close immediately after state
update. Given that region is not closed, disable is not complete.
if opening (or closing) times out, master will fence off RS and mark region
as closed. If there was some way of fencing region separately (ZK lease?) it
would be possible to use that.
In any case, until client checks table state before every write, there's no
easy way to prevent writes on disabling table. Writes on disabled table will
not be possible.
On ensuring there's no double assignment due to RS hanging:
The intent is to fence the WAL for region server, the way we do now. One could
also use other mechanism.
Perhaps I could specify it more clearly; I think the problem of making sure RS
is dead is nearly orthogonal.
In my model, due to how opening region is committed to opened, we can only be
unsure when the region is in Opened state (or similar states such as Splitting
which are not present in my current version, but will be added).
In that case, in absence of normal transition, we cannot do literally anything
with the region unless we are sufficiently sure that RS is sufficiently dead
(e.g. cannot write).
So, while we ensure that RS is dead we don't reassign.
My document implies (but doesn't elaborate, I'll fix that) that master does
direct Opened->Closed direct transition only when that is true.
A state called "MaybeOpened" could be added. Let me add it...
was (Author: sershe):
Answers lifted from email also (some fixes + one answer was modified due to
clarification here :)).
bq. What is a failure and how do you react to failures? I think the master5
design needs to spend more effort to considering failure and recovery cases. I
claim there are 4 types of responses from a networked IO operation - two
states we normally deal with ack successful, ack failed (nack) and unknown due
to timeout that succeeded (timeout success) and unknown due to timeout that
failed (timeout failed). We have historically missed the last two cases and
they aren't considered in the master5 design.
There are a few considerations. Let me examine if there are other cases than
these.
I am assuming the collocated table, which should reduce such cases for state
(probably, if collocated table cannot be written reliably, master must
stop-the-world and fail over).
When RS contacts master to do state update, it errs on the side of caution - no
state update, no open region (or split).
Thus, except for the case of multiple masters running, we can always assume RS
didn't online the region if we don't know about it.
Then, for messages to RS, see "Note on messages"; they are idempotent so they
can always be resent.
bq. 1) State update coordination. What is a "state updates from the outside"
Do RS's initiate splitting on their own? Maybe a picture would help so we can
figure out if it is similar or different from hbck-master's?
Yes, these are RS messages. They are mentioned in some operation descriptions
in part 2 - opening->opened, closing->closed; splitting, etc.
bq. 2) Single point of truth. hbck-master tries to define what single point
of truth means by defining intended, current, and actual state data with
durability properties on each kind. What do clients look at who modifies what?
Sorry, don't understand the question. I mean single source of truth mainly
about what is going on with the region; it is described in design
considerations.
I like the idea of "intended state", however without more detailed reading I am
not sure how it works for multiple ops e.g. master recovering the region while
the user intends to split it, so the split should be executed after it's opened.
bq. 3) Table record: "if regions is out of date, it should be closed and
reopened". It is not clear in master5 how regionservers find out that they are
out of date. Moreover, how do clients talking to those RS's with stale versions
know they are going to the correct RS especially in the face of RS failures due
to timeout?
On alter (and startup if failed), master tries to reopen all regions that are
out of date.
Regions that are not opened with either pick up the new version when they are
opened, or (e.g. if they are now Opening with old version) master discovers
they are out of date when they are transitioned to Opened by RS, and reopens
them again.
As for any case of alter on enabled table, there are no guarantees for clients.
To provide these w/o disable/enable (or logical equivalent of coordinating all
close-s and open-s), one would need some form of version-time-travel, or
waiting for versions, or both.
bq. 4) region record: transition states. This is really similar to
hbck-masters current state and intended state. Shouldn't be defined as part of
the region record?
I mention somewhere that could be done. One thing is that if several paths are
possible between states, it's useful to know which is taken.
But do note that I store user intent separately from what is currently going
on, so they are not exactly similar as far as I see.
bq. 5) Note on user operations: the forgetting thing is scary to me -- in your
move split example, what happens if an RS reads state that is forgotten?
I think my description of this might be too vague. State is not forgotten;
previous intent is forgotten. I.e. if user does several operations in order
that conflict (e.g. split and then merge), the first one will be canceled
(safely :)).
Also, RS does not read state as a guideline to what needs to be done.
bq. 6) table state machine. how do we guarantee clients are writing from the
correct version in the in failures?
The intent is to fence the WAL for region server, the way we do now. One could
also use other mechanism.
Perhaps I could specify it more clearly; I think the problem of making sure RS
is dead is nearly orthogonal.
In my model, due to how opening region is committed to opened, we can only be
unsure when the region is in Opened state (or similar states such as Splitting
which are not present in my current version, but will be added).
In that case, in absence of normal transition, we cannot do literally anything
with the region unless we are sufficiently sure that RS is sufficiently dead
(e.g. cannot write).
So, while we ensure that RS is dead we don't reassign.
My document implies (but doesn't elaborate, I'll fix that) that master does
direct Opened->Closed direct transition only when that is true.
A state called "MaybeOpened" could be added. Let me add it...
bq. 7) region state machine. Earlier draft hand splitting and merge cases.
Are they elided in master5 or are not present any more. How would this get
extended handle jeffrey's distributed log replay/fast write recovery feature?
As I mention somewhere these could be separate states. I was kind of afraid of
blowing up state machine too much, so I noticed that for split/merge you anyway
store siblings/children, so you can recognize them and for most purposes
different split-merge states are the same as Opened and Closed.
I will add those back, it would make sense.
bq. 8) logical interactions: sounds like master5 allows concurrent region and
table operations. hbck-master (though not fully documented) only allows
certain region transitions when the table is enabled or if the table is
disabled. Are we sure we don't get into race conditions? What happens if
disable gets issued -- its possible for someone to reopens the region and for
old clients to continue writing to it even though it is closed?
Yes, parallelism is intended. You can never be sure you have no races but we
should aim for it :)
master5 is missing disabled/enabled check, that is a mistake.
Part1 operation interactions already cover it:
table disable doesn't ack until all regions are closed (master5 is wrong
:().
region opening cannot start if table is already disabling or disabled.
if region is already opening when disable is issued, opening will be
opportunistically canceled.
if disable fails to cancel opening, or server opens it first in a race,
region will be opened, and master will issue close immediately after state
update. Given that region is not closed, disable is not complete.
if opening (or closing) times out, master will fence off RS and mark region
as closed. If there was some way of fencing region separately (ZK lease?) it
would be possible to use that.
In any case, until client checks table state before every write, there's no
easy way to prevent writes on disabling table. Writes on disabled table will
not be possible.
> Generic framework for Master-coordinated tasks
> ----------------------------------------------
>
> Key: HBASE-5487
> URL: https://issues.apache.org/jira/browse/HBASE-5487
> Project: HBase
> Issue Type: New Feature
> Components: master, regionserver, Zookeeper
> Affects Versions: 0.94.0
> Reporter: Mubarak Seyed
> Assignee: Sergey Shelukhin
> Priority: Critical
> Attachments: Entity management in Master - part 1.pdf,
> hbckMasterV2-long.pdf, Region management in Master5.docx, Region management
> in Master.pdf
>
>
> Need a framework to execute master-coordinated tasks in a fault-tolerant
> manner.
> Master-coordinated tasks such as online-scheme change and delete-range
> (deleting region(s) based on start/end key) can make use of this framework.
> The advantages of framework are
> 1. Eliminate repeated code in Master, ZooKeeper tracker and Region-server for
> master-coordinated tasks
> 2. Ability to abstract the common functions across Master -> ZK and RS -> ZK
> 3. Easy to plugin new master-coordinated tasks without adding code to core
> components
--
This message was sent by Atlassian JIRA
(v6.1#6144)