[
https://issues.apache.org/jira/browse/HBASE-20949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559101#comment-16559101
]
Duo Zhang commented on HBASE-20949:
-----------------------------------
OK I think the problem here is something like HBASE-20939.
After we dispatch the open region request, we suspend ourselves and return. And
the open region call finishes immediately and wake us up, and another PEWorker
takes charge the procedure and set the procedure state to SUCCESS, and then the
original PEWorker come back again, it also finds out that the procedure is in
SUCCESS state, so it also tries to finish the procedure and cause a double
release.
{code}
LockState lockState = acquireLock(proc);
switch (lockState) {
case LOCK_ACQUIRED:
execProcedure(procStack, proc);
break;
case LOCK_YIELD_WAIT:
LOG.info(lockState + " " + proc);
scheduler.yield(proc);
break;
case LOCK_EVENT_WAIT:
// Someone will wake us up when the lock is available
LOG.debug(lockState + " " + proc);
break;
default:
throw new UnsupportedOperationException();
}
procStack.release(proc);
if (proc.isSuccess()) {
// update metrics on finishing the procedure
proc.updateMetricsOnFinish(getEnvironment(), proc.elapsedTime(), true);
LOG.info("Finished " + proc + " in " +
StringUtils.humanTimeDiff(proc.elapsedTime()));
// Finalize the procedure state
if (proc.getProcId() == rootProcId) {
procedureFinished(proc);
} else {
execCompletionCleanup(proc);
}
break;
}
{code}
This is the critical part, the 'if(proc.isSuccess())' part has been executed
twice so we are dead.
Let me prepare a patch in HBASE-20939 to see if it helps.
> Split/Merge table can be executed concurrently with DisableTableProcedure
> -------------------------------------------------------------------------
>
> Key: HBASE-20949
> URL: https://issues.apache.org/jira/browse/HBASE-20949
> Project: HBase
> Issue Type: Sub-task
> Reporter: Duo Zhang
> Priority: Major
> Attachments: HBASE-20949-debug.patch
>
>
> The top flaky tests on the dashboard are all because of this.
> TestRestoreSnapshotFromClient
> TestSimpleRegionNormalizerOnCluster
> Theoretically this should not happen, need to dig more.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)