Jerry He created HBASE-11488:
--------------------------------

             Summary: cancelTasks in SubprocedurePool can hang during task error
                 Key: HBASE-11488
                 URL: https://issues.apache.org/jira/browse/HBASE-11488
             Project: HBase
          Issue Type: Bug
          Components: snapshots
    Affects Versions: 0.98.3, 0.96.1, 0.99.0
            Reporter: Jerry He
            Assignee: Jerry He
            Priority: Minor


During snapshot on the region server side, if one RegionSnapshotTask throws 
exception, we will cancel other tasks.
In 
RegionServerSnapshotManager.SnapshotSubprocedurePool.waitForOutstandingTasks():
{code}
      LOG.debug("Waiting for local region snapshots to finish.");

      int sz = futures.size();
      try {
        // Using the completion service to process the futures that finish 
first first.
        for (int i = 0; i < sz; i++) {
          Future<Void> f = taskPool.take();
          f.get();
          if (!futures.remove(f)) {
            LOG.warn("unexpected future" + f);
          }
          LOG.debug("Completed " + (i+1) + "/" + sz +  " local region 
snapshots.");
        }
        LOG.debug("Completed " + sz +  " local region snapshots.");
        return true;
      } catch (InterruptedException e) {
        LOG.warn("Got InterruptedException in SnapshotSubprocedurePool", e);
        if (!stopped) {
          Thread.currentThread().interrupt();
          throw new ForeignException("SnapshotSubprocedurePool", e);
        }
        // we are stopped so we can just exit.
      } catch (ExecutionException e) {
        if (e.getCause() instanceof ForeignException) {
          LOG.warn("Rethrowing ForeignException from SnapshotSubprocedurePool", 
e);
          throw (ForeignException)e.getCause();
        }
        LOG.warn("Got Exception in SnapshotSubprocedurePool", e);
        throw new ForeignException(name, e.getCause());
      } finally {
        cancelTasks();
      }
{code}
If  f.get() throws ExecutionException (for example, caused by 
NotServingRegionException), we will call cancelTasks().
In cancelTasks():
{code}
     ...
     // evict remaining tasks and futures from taskPool.
     while (!futures.isEmpty()) {
        // block to remove cancelled futures;
        LOG.warn("Removing cancelled elements from taskPool");
        futures.remove(taskPool.take());
      }
{code}

For example, suppose we have 3 tasks, the first one fails and we get an 
exception when we do:
{code}
          Future<Void> f = taskPool.take();
          f.get();
{code}
We didn't remove the 'f' from the 'futures' list yet, but we already take one 
from taskPool.
As a result, there are 3 in 'futures' list, but only 2 remain in taskPool.
We'll block on taskPool.take() in the above cancelTasks() code.

The end result is that the procedure will always fail timeout exception in the 
end. 
We could have bailed out earlier with the real cause.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to