[ 
https://issues.apache.org/jira/browse/HBASE-20949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559101#comment-16559101
 ] 

Duo Zhang commented on HBASE-20949:
-----------------------------------

OK I think the problem here is something like HBASE-20939.

After we dispatch the open region request, we suspend ourselves and return. And 
the open region call finishes immediately and wake us up, and another PEWorker 
takes charge the procedure and set the procedure state to SUCCESS, and then the 
original PEWorker come back again, it also finds out that the procedure is in 
SUCCESS state, so it also tries to finish the procedure and cause a double 
release.

{code}
      LockState lockState = acquireLock(proc);
      switch (lockState) {
        case LOCK_ACQUIRED:
          execProcedure(procStack, proc);
          break;
        case LOCK_YIELD_WAIT:
          LOG.info(lockState + " " + proc);
          scheduler.yield(proc);
          break;
        case LOCK_EVENT_WAIT:
          // Someone will wake us up when the lock is available
          LOG.debug(lockState + " " + proc);
          break;
        default:
          throw new UnsupportedOperationException();
      }
      procStack.release(proc);

      if (proc.isSuccess()) {
        // update metrics on finishing the procedure
        proc.updateMetricsOnFinish(getEnvironment(), proc.elapsedTime(), true);
        LOG.info("Finished " + proc + " in " + 
StringUtils.humanTimeDiff(proc.elapsedTime()));
        // Finalize the procedure state
        if (proc.getProcId() == rootProcId) {
          procedureFinished(proc);
        } else {
          execCompletionCleanup(proc);
        }
        break;
      }
{code}

This is the critical part, the 'if(proc.isSuccess())' part has been executed 
twice so we are dead.

Let me prepare a patch in HBASE-20939 to see if it helps.

> Split/Merge table can be executed concurrently with DisableTableProcedure
> -------------------------------------------------------------------------
>
>                 Key: HBASE-20949
>                 URL: https://issues.apache.org/jira/browse/HBASE-20949
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Duo Zhang
>            Priority: Major
>         Attachments: HBASE-20949-debug.patch
>
>
> The top flaky tests on the dashboard are all because of this.
> TestRestoreSnapshotFromClient
> TestSimpleRegionNormalizerOnCluster
> Theoretically this should not happen, need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to