[ 
https://issues.apache.org/jira/browse/HBASE-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654846#comment-16654846
 ] 

Jingyun Tian commented on HBASE-21291:
--------------------------------------

[~stack] I add the condition check to all procedures, not only the state 
machine procedure. 
{quote}
So, if override is set, make the waitTime some nominal amount – say 10ms? This 
way we wait on the lock for a little while but will proceed after 10ms even if 
we don't get the lock?
{quote}
Yes, it will wait 10ms to try to get lock. Then if we didn't get the lock but 
override is set, the bypass will be processed however. But the lock is released 
only when the stuck procedure finished.
{quote}
 finally {
      if (lockEntry != null) {
        procExecutionLock.releaseLockEntry(lockEntry);
      }
    }
{quote}
Thus restarting master is needed to resolve the problem.

> Add a test for bypassing stuck state-machine procedures
> -------------------------------------------------------
>
>                 Key: HBASE-21291
>                 URL: https://issues.apache.org/jira/browse/HBASE-21291
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>            Reporter: Jingyun Tian
>            Assignee: Jingyun Tian
>            Priority: Major
>             Fix For: 3.0.0, 2.2.0
>
>         Attachments: HBASE-21291.master.001.patch, 
> HBASE-21291.master.002.patch, HBASE-21291.master.003.patch, 
> HBASE-21291.master.004.patch, HBASE-21291.master.005.patch
>
>
> {code}
>       if (!procedure.isFailed()) {
>         if (subprocs != null) {
>           if (subprocs.length == 1 && subprocs[0] == procedure) {
>             // Procedure returned itself. Quick-shortcut for a state 
> machine-like procedure;
>             // i.e. we go around this loop again rather than go back out on 
> the scheduler queue.
>             subprocs = null;
>             reExecute = true;
>             LOG.trace("Short-circuit to next step on pid={}", 
> procedure.getProcId());
>           } else {
>             // Yield the current procedure, and make the subprocedure runnable
>             // subprocs may come back 'null'.
>             subprocs = initializeChildren(procStack, procedure, subprocs);
>             LOG.info("Initialized subprocedures=" +
>               (subprocs == null? null:
>                 Stream.of(subprocs).map(e -> "{" + e.toString() + "}").
>                 collect(Collectors.toList()).toString()));
>           }
>         } else if (procedure.getState() == ProcedureState.WAITING_TIMEOUT) {
>           LOG.debug("Added to timeoutExecutor {}", procedure);
>           timeoutExecutor.add(procedure);
>         } else if (!suspended) {
>           // No subtask, so we are done
>           procedure.setState(ProcedureState.SUCCESS);
>         }
>       }
> {code}
> Currently implementation of ProcedureExecutor will set the reExcecute to true 
> for state machine like procedure. Then if this procedure is stuck at one 
> certain state, it will loop forever.
> {code}
>           IdLock.Entry lockEntry = 
> procExecutionLock.getLockEntry(proc.getProcId());
>           try {
>             executeProcedure(proc);
>           } catch (AssertionError e) {
>             LOG.info("ASSERT pid=" + proc.getProcId(), e);
>             throw e;
>           } finally {
>             procExecutionLock.releaseLockEntry(lockEntry);
> {code}
> Since procedure will get the IdLock and release it after execution done, 
> state machine procedure will never release IdLock until it is finished.
> Then bypassProcedure doesn't work because is will try to grab the IdLock at 
> first.
> {code}
>     IdLock.Entry lockEntry = 
> procExecutionLock.tryLockEntry(procedure.getProcId(), lockWait);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to