[ 
https://issues.apache.org/jira/browse/HBASE-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Cang updated HBASE-21083:
----------------------------
    Description: 
Offline discussed with [~stack] and [~Apache9]. We all agreed that we need to 
introduce a mechanism to 'force complete' a stuck procedure, so the AMv2 can 
continue running.
we still have some unrevealed bugs hiding in our AMv2 and procedureV2 system, 
we need something to interfere with stuck procedures before HBCK2 can work. 
This is very crucial for a production ready system.

For now, we have little ways to interfere with running procedures. Aborting 
them is not a good choice, since some procedures are not abort-able. And some 
procedure may have overridden the abort() method, which will ignore the abort 
request.

So, here, I will introduce a mechanism to bypass the execution of a stuck 
procedure.
Basically, I added a field called 'bypass' to Procedure class. If we set this 
field to true, all the logic in execute/rollback will be skipped, letting this 
procedure and its ancestors complete normally and releasing the lock resources 
at last.

Notice that bypassing a procedure may leave the cluster in a middle state, e.g. 
the region not assigned, or some hdfs files left behind. 
The Operators need know the side effect of bypassing and recover the 
inconsistent state of the cluster themselves, like issuing new procedures to 
assign the regions.

A patch will be uploaded and review board will be open. For now, only APIs in 
ProcedureExecutor are provided. If anything is fine, I will add it to master 
service and add a shell command to bypass a procedure. Or, maybe we can use 
dynamically compiled JSPs to execute those APIs as mentioned in HBASE-20679.

  was:
Offline discussed with [~stack] and [~Apache9]. We all agreed that we need to 
introduce a mechanism to 'force complete' a stuck procedure, so the AMv2 can 
continue running.
 we still have some unrevealed bugs hiding in our AMv2 and procedureV2 system, 
we need something to interfere with stuck procedures before HBCK2 can work. 
This is very crucial for a production ready system. 

For now, we have little ways to interfere with running procedures. Aborting 
them is not a good choice, since some procedures are not abort-able. And some 
procedure may have overridden the abort() method, which will ignore the abort 
request.

So, here, I will introduce a mechanism  to bypass the execution of a stuck 
procedure.
Basically, I added a field called 'bypass' to Procedure class. If we set this 
field to true, all the logic in execute/rollback will be skipped, letting this 
procedure and its ancestors complete normally and releasing the lock resources 
at last.

Notice that bypassing a procedure may leave the cluster in a middle state, e.g. 
the region not assigned, or some hdfs files left behind. 
The Operators need know the side effect of bypassing and recover the 
inconsistent state of the cluster themselves, like issuing new procedures to 
assign the regions.

A patch will be uploaded and review board will be open. For now, only APIs in 
ProcedureExecutor are provided. If anything is fine, I will add it to master 
service and add a shell command to bypass a procedure. Or, maybe we can use 
dynamically compiled JSPs to execute those APIs as mentioned in HBASE-20679. 



> Introduce a mechanism to bypass the execution of a stuck procedure
> ------------------------------------------------------------------
>
>                 Key: HBASE-21083
>                 URL: https://issues.apache.org/jira/browse/HBASE-21083
>             Project: HBase
>          Issue Type: Sub-task
>          Components: amv2
>    Affects Versions: 2.1.0, 2.0.1
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>             Fix For: 3.0.0, 2.1.1, 2.0.2
>
>         Attachments: HBASE-21083.branch-2.0.001.patch, 
> HBASE-21083.branch-2.0.002.patch, HBASE-21083.branch-2.0.003.patch, 
> HBASE-21083.branch-2.0.003.patch, HBASE-21083.branch-2.0.003.patch, 
> HBASE-21083.branch-2.1.001.patch
>
>
> Offline discussed with [~stack] and [~Apache9]. We all agreed that we need to 
> introduce a mechanism to 'force complete' a stuck procedure, so the AMv2 can 
> continue running.
> we still have some unrevealed bugs hiding in our AMv2 and procedureV2 system, 
> we need something to interfere with stuck procedures before HBCK2 can work. 
> This is very crucial for a production ready system.
> For now, we have little ways to interfere with running procedures. Aborting 
> them is not a good choice, since some procedures are not abort-able. And some 
> procedure may have overridden the abort() method, which will ignore the abort 
> request.
> So, here, I will introduce a mechanism to bypass the execution of a stuck 
> procedure.
> Basically, I added a field called 'bypass' to Procedure class. If we set this 
> field to true, all the logic in execute/rollback will be skipped, letting 
> this procedure and its ancestors complete normally and releasing the lock 
> resources at last.
> Notice that bypassing a procedure may leave the cluster in a middle state, 
> e.g. the region not assigned, or some hdfs files left behind. 
> The Operators need know the side effect of bypassing and recover the 
> inconsistent state of the cluster themselves, like issuing new procedures to 
> assign the regions.
> A patch will be uploaded and review board will be open. For now, only APIs in 
> ProcedureExecutor are provided. If anything is fine, I will add it to master 
> service and add a shell command to bypass a procedure. Or, maybe we can use 
> dynamically compiled JSPs to execute those APIs as mentioned in HBASE-20679.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to