Jerry He created HBASE-11488:
--------------------------------
Summary: cancelTasks in SubprocedurePool can hang during task error
Key: HBASE-11488
URL: https://issues.apache.org/jira/browse/HBASE-11488
Project: HBase
Issue Type: Bug
Components: snapshots
Affects Versions: 0.98.3, 0.96.1, 0.99.0
Reporter: Jerry He
Assignee: Jerry He
Priority: Minor
During snapshot on the region server side, if one RegionSnapshotTask throws
exception, we will cancel other tasks.
In
RegionServerSnapshotManager.SnapshotSubprocedurePool.waitForOutstandingTasks():
{code}
LOG.debug("Waiting for local region snapshots to finish.");
int sz = futures.size();
try {
// Using the completion service to process the futures that finish
first first.
for (int i = 0; i < sz; i++) {
Future<Void> f = taskPool.take();
f.get();
if (!futures.remove(f)) {
LOG.warn("unexpected future" + f);
}
LOG.debug("Completed " + (i+1) + "/" + sz + " local region
snapshots.");
}
LOG.debug("Completed " + sz + " local region snapshots.");
return true;
} catch (InterruptedException e) {
LOG.warn("Got InterruptedException in SnapshotSubprocedurePool", e);
if (!stopped) {
Thread.currentThread().interrupt();
throw new ForeignException("SnapshotSubprocedurePool", e);
}
// we are stopped so we can just exit.
} catch (ExecutionException e) {
if (e.getCause() instanceof ForeignException) {
LOG.warn("Rethrowing ForeignException from SnapshotSubprocedurePool",
e);
throw (ForeignException)e.getCause();
}
LOG.warn("Got Exception in SnapshotSubprocedurePool", e);
throw new ForeignException(name, e.getCause());
} finally {
cancelTasks();
}
{code}
If f.get() throws ExecutionException (for example, caused by
NotServingRegionException), we will call cancelTasks().
In cancelTasks():
{code}
...
// evict remaining tasks and futures from taskPool.
while (!futures.isEmpty()) {
// block to remove cancelled futures;
LOG.warn("Removing cancelled elements from taskPool");
futures.remove(taskPool.take());
}
{code}
For example, suppose we have 3 tasks, the first one fails and we get an
exception when we do:
{code}
Future<Void> f = taskPool.take();
f.get();
{code}
We didn't remove the 'f' from the 'futures' list yet, but we already take one
from taskPool.
As a result, there are 3 in 'futures' list, but only 2 remain in taskPool.
We'll block on taskPool.take() in the above cancelTasks() code.
The end result is that the procedure will always fail timeout exception in the
end.
We could have bailed out earlier with the real cause.
--
This message was sent by Atlassian JIRA
(v6.2#6252)