[
https://issues.apache.org/jira/browse/HBASE-11488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059041#comment-14059041
]
Matteo Bertozzi commented on HBASE-11488:
-----------------------------------------
+1
> cancelTasks in SubprocedurePool can hang during task error
> ----------------------------------------------------------
>
> Key: HBASE-11488
> URL: https://issues.apache.org/jira/browse/HBASE-11488
> Project: HBase
> Issue Type: Bug
> Components: snapshots
> Affects Versions: 0.96.1, 0.99.0, 0.98.3
> Reporter: Jerry He
> Assignee: Jerry He
> Priority: Minor
> Attachments: HBASE-11488-master.patch
>
>
> During snapshot on the region server side, if one RegionSnapshotTask throws
> exception, we will cancel other tasks.
> In
> RegionServerSnapshotManager.SnapshotSubprocedurePool.waitForOutstandingTasks():
> {code}
> LOG.debug("Waiting for local region snapshots to finish.");
> int sz = futures.size();
> try {
> // Using the completion service to process the futures that finish
> first first.
> for (int i = 0; i < sz; i++) {
> Future<Void> f = taskPool.take();
> f.get();
> if (!futures.remove(f)) {
> LOG.warn("unexpected future" + f);
> }
> LOG.debug("Completed " + (i+1) + "/" + sz + " local region
> snapshots.");
> }
> LOG.debug("Completed " + sz + " local region snapshots.");
> return true;
> } catch (InterruptedException e) {
> LOG.warn("Got InterruptedException in SnapshotSubprocedurePool", e);
> if (!stopped) {
> Thread.currentThread().interrupt();
> throw new ForeignException("SnapshotSubprocedurePool", e);
> }
> // we are stopped so we can just exit.
> } catch (ExecutionException e) {
> if (e.getCause() instanceof ForeignException) {
> LOG.warn("Rethrowing ForeignException from
> SnapshotSubprocedurePool", e);
> throw (ForeignException)e.getCause();
> }
> LOG.warn("Got Exception in SnapshotSubprocedurePool", e);
> throw new ForeignException(name, e.getCause());
> } finally {
> cancelTasks();
> }
> {code}
> If f.get() throws ExecutionException (for example, caused by
> NotServingRegionException), we will call cancelTasks().
> In cancelTasks():
> {code}
> ...
> // evict remaining tasks and futures from taskPool.
> while (!futures.isEmpty()) {
> // block to remove cancelled futures;
> LOG.warn("Removing cancelled elements from taskPool");
> futures.remove(taskPool.take());
> }
> {code}
> For example, suppose we have 3 tasks, the first one fails and we get an
> exception when we do:
> {code}
> Future<Void> f = taskPool.take();
> f.get();
> {code}
> We didn't remove the 'f' from the 'futures' list yet, but we already take one
> from taskPool.
> As a result, there are 3 in 'futures' list, but only 2 remain in taskPool.
> We'll block on taskPool.take() in the above cancelTasks() code.
> The end result is that the procedure will always fail timeout exception in
> the end.
> We could have bailed out earlier with the real cause.
--
This message was sent by Atlassian JIRA
(v6.2#6252)