[
https://issues.apache.org/jira/browse/HBASE-12439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207164#comment-14207164
]
Matteo Bertozzi commented on HBASE-12439:
-----------------------------------------
{quote}This will be a huge feature for MTTR and online reliability – why the
Minor label?{quote}
because it is probably off to a 1.x or 2.x, since there are a bunch of core
changes (e.g. handlers, maybe assignment, ...)
{quote}client submits a "procedure" that it's interested in observing through
it's execution progress{quote}
correct, e.g. you call create table you get a procedure id that you can wait
on, or check for a progress state if there is any exposed by the procedure.
{quote}a procedure is defined as a DAG of sub-procedures that are required to
complete procedure execution{quote}
correct
{quote}multiple sub-procedures can be executed in parallel{quote}
correct, the example in my mind here is Snapshot or EnableTable where the
"Enable Procedure" spawn the sub-procedures for assigning each region and they
can be executed in parallel.
{quote}a sub-procedure can define an action that must be taken on multiple
hosts{quote}
not sure to understand this one. a sub-procedure is an operation. it can be a
simple "write to meta" or it can be "send a snapshot request to the RS". If you
are thinking about stuff like Snapshots or ACL cache updates you basically have
two components a coordinator on the Master side and an Executor on the RS, so
the Procedure on the master looks like "send the operation to the RSs and wait
on ACK do the finalization".
{quote}DAG execution progress is tracked through a storage system{quote}
correct, every time a procedure is executed we write the state out and we can
resume from that point. So we are in the middle of a create table, the master
goes down the backup master can start from stepN of that create table that was
in progress
{quote}procedure execution can be halted and reverted at any time{quote}
yes, you send an abort() to the procedure and if it was started it is rolledback
{quote}completed DAG sub-procedures must be able to roll-back in the event of
procedure revert{quote}
yes, each step should implement a rollback() and that is called when one of the
steps failed or the user aborted
{quote}procedure execution is tied to transitions through a persisted state
machine{quote}
correct.
{quote}all procedures have the same set of states through which they can
transition{quote}
not sure what you mean, but the framework has its own fixed set of state
"runnable/waiting/rollingback/failed/completed" and the user code that
implement the procedure doesn't care about this stuff. It just care about
"step1 -> step 2 -> step 3"
{quote}Why implement a separate store? Can we not use a system table for the
procedure state store?{quote}
The store is just an interface insert/remove, you can use what ever you want.
The main problem of using a table is that you end up with the chicken egg
problem.
How can I create a table if I need a table to write the procedure state?
then I can say that if you use just a wal you can just drop the wal once there
are no procedure running, so you can avoid the compaction overhead and so on..
but that is just an optimization the main problem is the startup loop.
> Procedure V2
> ------------
>
> Key: HBASE-12439
> URL: https://issues.apache.org/jira/browse/HBASE-12439
> Project: HBase
> Issue Type: New Feature
> Components: master
> Affects Versions: 2.0.0
> Reporter: Matteo Bertozzi
> Assignee: Matteo Bertozzi
> Priority: Minor
> Attachments: ProcedureV2.pdf
>
>
> Procedure v2 (aka Notification Bus) aims to provide a unified way to build:
> * multi-steps procedure with a rollback/rollforward ability in case of
> failure (e.g. create/delete table)
> ** HBASE-12070
> * notifications across multiple machines (e.g. ACLs/Labels/Quotas cache
> updates)
> ** Make sure that every machine has the grant/revoke/label
> ** Enforce "space limit" quota across the namespace
> ** HBASE-10295 eliminate permanent replication zk node
> * procedures across multiple machines (e.g. Snapshots)
> * coordinated long-running procedures (e.g. compactions, splits, ...)
> * Synchronous calls, with the ability to see the state/result in case of
> failure.
> ** HBASE-11608 sync split
> still work in progress/initial prototype: https://reviews.apache.org/r/27703/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)