[jira] [Comment Edited] (HBASE-5487) Generic framework for Master-coordinated tasks

Sergey Shelukhin (JIRA) Fri, 18 Oct 2013 15:51:43 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799619#comment-13799619
 ]


Sergey Shelukhin edited comment on HBASE-5487 at 10/18/13 10:50 PM:
--------------------------------------------------------------------

Answers lifted from email also (some fixes + one answer was modified due to 
clarification here :)).

bq.  What is a failure and how do you react to failures? I think the master5 
design needs to spend more effort  to considering failure and recovery cases. I 
claim there are 4 types of responses from a networked IO operation -  two 
states we normally deal with ack successful, ack failed (nack) and unknown due 
to timeout  that succeeded (timeout success) and unknown due to timeout that 
failed (timeout failed). We have historically missed the last two cases and 
they aren't considered in the master5 design. 

There are a few considerations. Let me examine if there are other cases than 
these.
I am assuming the collocated table, which should reduce such cases for state 
(probably, if collocated table cannot be written reliably, master must 
stop-the-world and fail over).
When RS contacts master to do state update, it errs on the side of caution - no 
state update, no open region (or split).
Thus, except for the case of multiple masters running, we can always assume RS 
didn't online the region if we don't know about it.
Then, for messages to RS, see "Note on messages"; they are idempotent so they 
can always be resent.

bq.  1) State update coordination.  What is a "state updates from the outside"  
Do RS's initiate splitting on their own?  Maybe a picture would help so we can 
figure out if it is similar or different from hbck-master's?

Yes, these are RS messages. They are mentioned in some operation descriptions 
in part 2 - opening->opened, closing->closed; splitting, etc.

bq.  2) Single point of truth.  hbck-master tries to define what single point 
of truth means by defining intended, current, and actual state data with 
durability properties on each kind. What do clients look at who modifies what? 

Sorry, don't understand the question. I mean single source of truth mainly 
about what is going on with the region; it is described in design 
considerations.
I like the idea of "intended state", however without more detailed reading I am 
not sure how it works for multiple ops e.g. master recovering the region while 
the user intends to split it, so the split should be executed after it's opened.

bq.  3) Table record: "if regions is out of date, it should be closed and 
reopened". It is not clear in master5 how regionservers find out that they are 
out of date. Moreover, how do clients talking to those RS's with stale versions 
know they are going to the correct RS especially in the face of RS failures due 
to timeout?

On alter (and startup if failed), master tries to reopen all regions that are 
out of date.
Regions that are not opened with either pick up the new version when they are 
opened, or (e.g. if they are now Opening with old version) master discovers 
they are out of date when they are transitioned to Opened by RS, and reopens 
them again.

As for any case of alter on enabled table, there are no guarantees for clients.
To provide these w/o disable/enable (or logical equivalent of coordinating all 
close-s and open-s), one would need some form of version-time-travel, or 
waiting for versions, or both.

bq.  4) region record: transition states.  This is really similar to 
hbck-masters current state and intended state.  Shouldn't be defined as part of 
the region record?

I mention somewhere that could be done. One thing is that if several paths are 
possible between states, it's useful to know which is taken.
But do note that I store user intent separately from what is currently going 
on, so they are not exactly similar as far as I see.

bq.  5) Note on user operations: the forgetting thing is scary to me -- in your 
move split example, what happens if an RS reads state that is forgotten?  

I think my description of this might be too vague. State is not forgotten; 
previous intent is forgotten. I.e. if user does several operations in order 
that conflict (e.g. split and then merge), the first one will be canceled 
(safely :)).
Also, RS does not read state as a guideline to what needs to be done.

bq.  6) table state machine. how do we guarantee clients are writing from the 
correct version in the in failures?
Can you please elaborate?

bq.  7) region state machine.  Earlier draft hand splitting and merge cases.  
Are they elided in master5 or are not present any more. How would this get 
extended handle jeffrey's distributed log replay/fast write recovery feature? 

As I mention somewhere these could be separate states. I was kind of afraid of 
blowing up state machine too much, so I noticed that for split/merge you anyway 
store siblings/children, so you can recognize them and for most purposes 
different split-merge states are the same as Opened and Closed.

I will add those back, it would make sense.

bq.  8) logical interactions:  sounds like master5 allows concurrent region and 
table operations.  hbck-master (though not fully documented) only allows 
certain region transitions when the table is enabled or if the table is 
disabled.  Are we sure we don't get into race conditions?  What happens if 
disable gets issued -- its possible for someone to reopens the region and for 
old clients to continue writing to it even though it is closed?

Yes, parallelism is intended. You can never be sure you have no races but we 
should aim for it :)

master5 is missing disabled/enabled check, that is a mistake.

Part1 operation interactions already cover it:

    table disable doesn't ack until all regions are closed (master5 is wrong 
:().
    region opening cannot start if table is already disabling or disabled.
    if region is already opening when disable is issued, opening will be 
opportunistically canceled.
    if disable fails to cancel opening, or server opens it first in a race, 
region will be opened, and master will issue close immediately after state 
update. Given that region is not closed, disable is not complete.
    if opening (or closing) times out, master will fence off RS and mark region 
as closed. If there was some way of fencing region separately (ZK lease?) it 
would be possible to use that.

In any case, until client checks table state before every write, there's no 
easy way to prevent writes on disabling table. Writes on disabled table will 
not be possible.


On ensuring there's no double assignment due to RS hanging:
The intent is to fence the WAL for region server, the way we do now. One could 
also use other mechanism.
Perhaps I could specify it more clearly; I think the problem of making sure RS 
is dead is nearly orthogonal.
In my model, due to how opening region is committed to opened, we can only be 
unsure when the region is in Opened state (or similar states such as Splitting 
which are not present in my current version, but will be added).
In that case, in absence of normal transition, we cannot do literally anything 
with the region unless we are sufficiently sure that RS is sufficiently dead 
(e.g. cannot write).
So, while we ensure that RS is dead we don't reassign.
My document implies (but doesn't elaborate, I'll fix that) that master does 
direct Opened->Closed direct transition only when that is true.
A state called "MaybeOpened" could be added. Let me add it...


was (Author: sershe):
Answers lifted from email also (some fixes + one answer was modified due to 
clarification here :)).

bq.  What is a failure and how do you react to failures? I think the master5 
design needs to spend more effort  to considering failure and recovery cases. I 
claim there are 4 types of responses from a networked IO operation -  two 
states we normally deal with ack successful, ack failed (nack) and unknown due 
to timeout  that succeeded (timeout success) and unknown due to timeout that 
failed (timeout failed). We have historically missed the last two cases and 
they aren't considered in the master5 design. 

There are a few considerations. Let me examine if there are other cases than 
these.
I am assuming the collocated table, which should reduce such cases for state 
(probably, if collocated table cannot be written reliably, master must 
stop-the-world and fail over).
When RS contacts master to do state update, it errs on the side of caution - no 
state update, no open region (or split).
Thus, except for the case of multiple masters running, we can always assume RS 
didn't online the region if we don't know about it.
Then, for messages to RS, see "Note on messages"; they are idempotent so they 
can always be resent.

bq.  1) State update coordination.  What is a "state updates from the outside"  
Do RS's initiate splitting on their own?  Maybe a picture would help so we can 
figure out if it is similar or different from hbck-master's?

Yes, these are RS messages. They are mentioned in some operation descriptions 
in part 2 - opening->opened, closing->closed; splitting, etc.

bq.  2) Single point of truth.  hbck-master tries to define what single point 
of truth means by defining intended, current, and actual state data with 
durability properties on each kind. What do clients look at who modifies what? 

Sorry, don't understand the question. I mean single source of truth mainly 
about what is going on with the region; it is described in design 
considerations.
I like the idea of "intended state", however without more detailed reading I am 
not sure how it works for multiple ops e.g. master recovering the region while 
the user intends to split it, so the split should be executed after it's opened.

bq.  3) Table record: "if regions is out of date, it should be closed and 
reopened". It is not clear in master5 how regionservers find out that they are 
out of date. Moreover, how do clients talking to those RS's with stale versions 
know they are going to the correct RS especially in the face of RS failures due 
to timeout?

On alter (and startup if failed), master tries to reopen all regions that are 
out of date.
Regions that are not opened with either pick up the new version when they are 
opened, or (e.g. if they are now Opening with old version) master discovers 
they are out of date when they are transitioned to Opened by RS, and reopens 
them again.

As for any case of alter on enabled table, there are no guarantees for clients.
To provide these w/o disable/enable (or logical equivalent of coordinating all 
close-s and open-s), one would need some form of version-time-travel, or 
waiting for versions, or both.

bq.  4) region record: transition states.  This is really similar to 
hbck-masters current state and intended state.  Shouldn't be defined as part of 
the region record?

I mention somewhere that could be done. One thing is that if several paths are 
possible between states, it's useful to know which is taken.
But do note that I store user intent separately from what is currently going 
on, so they are not exactly similar as far as I see.

bq.  5) Note on user operations: the forgetting thing is scary to me -- in your 
move split example, what happens if an RS reads state that is forgotten?  

I think my description of this might be too vague. State is not forgotten; 
previous intent is forgotten. I.e. if user does several operations in order 
that conflict (e.g. split and then merge), the first one will be canceled 
(safely :)).
Also, RS does not read state as a guideline to what needs to be done.

bq.  6) table state machine. how do we guarantee clients are writing from the 
correct version in the in failures?

The intent is to fence the WAL for region server, the way we do now. One could 
also use other mechanism.
Perhaps I could specify it more clearly; I think the problem of making sure RS 
is dead is nearly orthogonal.
In my model, due to how opening region is committed to opened, we can only be 
unsure when the region is in Opened state (or similar states such as Splitting 
which are not present in my current version, but will be added).
In that case, in absence of normal transition, we cannot do literally anything 
with the region unless we are sufficiently sure that RS is sufficiently dead 
(e.g. cannot write).
So, while we ensure that RS is dead we don't reassign.
My document implies (but doesn't elaborate, I'll fix that) that master does 
direct Opened->Closed direct transition only when that is true.
A state called "MaybeOpened" could be added. Let me add it...

bq.  7) region state machine.  Earlier draft hand splitting and merge cases.  
Are they elided in master5 or are not present any more. How would this get 
extended handle jeffrey's distributed log replay/fast write recovery feature? 

As I mention somewhere these could be separate states. I was kind of afraid of 
blowing up state machine too much, so I noticed that for split/merge you anyway 
store siblings/children, so you can recognize them and for most purposes 
different split-merge states are the same as Opened and Closed.

I will add those back, it would make sense.

bq.  8) logical interactions:  sounds like master5 allows concurrent region and 
table operations.  hbck-master (though not fully documented) only allows 
certain region transitions when the table is enabled or if the table is 
disabled.  Are we sure we don't get into race conditions?  What happens if 
disable gets issued -- its possible for someone to reopens the region and for 
old clients to continue writing to it even though it is closed?

Yes, parallelism is intended. You can never be sure you have no races but we 
should aim for it :)

master5 is missing disabled/enabled check, that is a mistake.

Part1 operation interactions already cover it:

    table disable doesn't ack until all regions are closed (master5 is wrong 
:().
    region opening cannot start if table is already disabling or disabled.
    if region is already opening when disable is issued, opening will be 
opportunistically canceled.
    if disable fails to cancel opening, or server opens it first in a race, 
region will be opened, and master will issue close immediately after state 
update. Given that region is not closed, disable is not complete.
    if opening (or closing) times out, master will fence off RS and mark region 
as closed. If there was some way of fencing region separately (ZK lease?) it 
would be possible to use that.

In any case, until client checks table state before every write, there's no 
easy way to prevent writes on disabling table. Writes on disabled table will 
not be possible.

> Generic framework for Master-coordinated tasks
> ----------------------------------------------
>
>                 Key: HBASE-5487
>                 URL: https://issues.apache.org/jira/browse/HBASE-5487
>             Project: HBase
>          Issue Type: New Feature
>          Components: master, regionserver, Zookeeper
>    Affects Versions: 0.94.0
>            Reporter: Mubarak Seyed
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>         Attachments: Entity management in Master - part 1.pdf, 
> hbckMasterV2-long.pdf, Region management in Master5.docx, Region management 
> in Master.pdf
>
>
> Need a framework to execute master-coordinated tasks in a fault-tolerant 
> manner. 
> Master-coordinated tasks such as online-scheme change and delete-range 
> (deleting region(s) based on start/end key) can make use of this framework.
> The advantages of framework are
> 1. Eliminate repeated code in Master, ZooKeeper tracker and Region-server for 
> master-coordinated tasks
> 2. Ability to abstract the common functions across Master -> ZK and RS -> ZK
> 3. Easy to plugin new master-coordinated tasks without adding code to core 
> components



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Comment Edited] (HBASE-5487) Generic framework for Master-coordinated tasks

Reply via email to