[ 
https://issues.apache.org/jira/browse/HBASE-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509463#comment-13509463
 ] 

Jonathan Hsieh commented on HBASE-7212:
---------------------------------------

I need to take a look at the source implementation of the curator double 
barrier and examples of its use to do a better job of comparing. Based on the 
api and the zk recipes, I'm going to make some assumptions here.

As another analogy, it seems that our procedure mechanism is similar to a 
monitor (synchronized in java) that guarantees enter/acquire and leave/release 
of the barrier parts, while the curator one is lower level and leaves it to the 
implementer to enforce that invariant.

bq. Rather than 'abort', you could just timeout? That might be simpler still? 
i.e. your "Need to be able to force failure after a specified timeout elapses"

So in this patch, the time-based abort trigger and a potential user-induced 
cancellation uses the same mechanism to notify all members (and the 
coordinator) that the procedure has aborted.  

I'm speculating but with think one assumption with this mechanism has vs the 
double barrier's is that we assume that the actions on the members may be slow 
(one implementation waits for a memstore flush per region) and may need to be 
interrupted before completion.  The curator double barrier api doesn't have 
such a mechanism and we may have to wait for all operations to complete before 
we can abort them.

bq. double-barrier does not seem to be enough though. There needs to be a means 
of telling cluster members to go for a particular snapshot barrier. To this 
end, I suppose all members need to be watching a snapshot dir and when a new 
snapshot appears, all try to 'enter' its barrier?

I believe that would be the case if we used curator.  I don't think we can't 
use it -- and the factoring out of the *Comms/*Rpcs would potentially allow us 
to move that in a future rev.

bq. s it true that you do not want members to start 'snapshotting' until ALL 
participants have 'entered' the barrier? Does it matter if they start doing 
their work soon as they 'enter' the barrier (using curator/zk receipe terms). 
Reading on, it seems like its fine if members just go about their merry 
way....working on their part of snapshot. If not all members complete, the 
coordinator will clean up the incomplete.

At the end of the day, the full barrier is only required for the snapshot that 
completely blocks all writes to get a truly consistent snapshot.  The weaker 
snapshots (either the timestamp based or log roll based) won't give those 
guarantees and doesn't actually need the full barrier.  For the first cut 
however, I'm probably going to use it since it handles the error propagation 
and cross process cancellation.

bq. What do you think of the terms in the zk receipe: i.e. rather than 'reach' 
a barrier, 'enter' it?

I'm fine with it -- I'll change the terms acquire -> enter, reached -> leave in 
the next rev I post.  (in the v3 version I still need to clean up the 
nomenclature in the tests).

I'll do another rev of the docs to make it consistent with the changes being 
made.



                
> Globally Barriered Procedure mechanism
> --------------------------------------
>
>                 Key: HBASE-7212
>                 URL: https://issues.apache.org/jira/browse/HBASE-7212
>             Project: HBase
>          Issue Type: Sub-task
>          Components: snapshots
>    Affects Versions: hbase-6055
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>             Fix For: hbase-6055
>
>         Attachments: 121127-global-barrier-proc.pdf, hbase-7212.patch, 
> pre-hbase-7212.patch
>
>
> This is a simplified version of what was proposed in HBASE-6573.  Instead of 
> claiming to be a 2pc or 3pc implementation (which implies logging at each 
> actor, and recovery operations) this is just provides a best effort global 
> barrier mechanism called a Procedure.  
> Users need only to implement a methods to acquireBarrier, to act when 
> insideBarrier, and to releaseBarrier that use the ExternalException 
> cooperative error checking mechanism.
> Globally consistent snapshots require the ability to quiesce writes to a set 
> of region servers before a the snapshot operation is executed.  Also if any 
> node fails, it needs to be able to notify them so that they abort.
> The first cut of other online snapshots don't need the fully barrier but may 
> still use this for its error propagation mechanisms.
> This version removes the extra layer incurred in the previous implementation 
> due to the use of generics, separates the coordinator and members, and 
> reduces the amount of inheritance used in favor of composition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to