[ 
https://issues.apache.org/jira/browse/HBASE-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540414#comment-13540414
 ] 

terry zhang commented on HBASE-7451:
------------------------------------

This is because GloballyConsistentRegionLockTask is extends from TwoPhaseCommit 
(not ThreePhaseCommit). So it do not have OperationAttemptTimer to handle 
timeout exception. So when GlobalSnapshotOperation meet timeout,below code will 
not be executed. and AllowCommit will not released.

{code:title=GlobalSnapshotOperation.java|borderStyle=solid}

  @Override
  public void commit() throws DistributedCommitException {
    // Release all the locks taken on the involved regions
    if (ops == null || ops.size() == 0) {
      LOG.debug("No region operations to release from the snapshot because we 
didn't get a chance"
          + " to create them.");
      return;
    }
    LOG.debug("Releasing commit barrier for globally consistent snapshot.");
    for (RegionSnapshotOperation op : ops) {
      ((GloballyConsistentRegionLockTask) op).getAllowCommitLatch().countDown();
    }

    // wait for all the outstanding tasks
    waitUntilDone();
  }
{code} 

So GloballyConsistentRegionLockTask will wait for ever.

 
                
> [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout 
> happen
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-7451
>                 URL: https://issues.apache.org/jira/browse/HBASE-7451
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>            Reporter: terry zhang
>            Assignee: terry zhang
>
> Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in github 
> : https://github.com/matteobertozzi/hbase/
> If we create a snapshot and meet regionserver timeout. Rs will be lock and 
> can not put any data. Please take a look at log below :
> // regionserver snapshot timeout
> org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitException: 
> org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
>  Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
> at 
> org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java:135)
> at 
> org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDispatcher.java:71)
> at 
> org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:92)
> at 
> org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:89)
> at 
> org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.run(OperationAttemptTimer.java:71)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> Caused by: 
> org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
>  Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
> ... 3 more
> 2012-12-26 18:44:57,211 DEBUG 
> org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
> 2012-12-26 18:44:57,211 DEBUG 
> org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
> Cleanup snapshot - handled in sub-tasks on error
> 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv
> //Waiting for 'commit allowed' latch and do not exist
> 2012-12-26 18:44:57,211 DEBUG 
> org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
> 2012-12-26 18:44:57,211 DEBUG 
> org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
> Cleanup snapshot - handled in sub-tasks on error
> 2012-12-26 18:44:57,212 DEBUG 
> org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running finish phase.
> 2012-12-26 18:44:57,212 DEBUG 
> org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
> Finish snapshot - handling in subtasks on error
> 2012-12-26 18:44:57,212 WARN 
> org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer: Timer 
> already marked completed, ignoring!
> 2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:17,002 INFO 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Received children changed event:/hbase-TERRY-73/online-snapshot/prepare
> 2012-12-26 18:45:17,002 INFO 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Recieved start event.
> 2012-12-26 18:45:17,002 DEBUG 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Looking for new operations under 
> znode:/hbase-TERRY-73/online-snapshot/prepare
> 2012-12-26 18:45:17,003 INFO 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Received children changed event:/hbase-TERRY-73/online-snapshot/abort
> 2012-12-26 18:45:17,003 INFO 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Recieved abort event.
> 2012-12-26 18:45:17,003 DEBUG 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Checking for aborted operations on node:/hbase-TERRY-73/online-snapshot/abort
> 2012-12-26 18:45:21,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:26,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:31,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:36,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:41,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:43,481 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
> LRU Stats: total=11.77 MB, free=1.39 GB, max=1.4 GB, blocks=5, accesses=96, 
> hits=91, hitRatio=94.79%, cachingAccesses=96, cachingHits=91, 
> cachingHitsRatio=94.79%, evictions=0, evicted=0, evictedPerRun=NaN
> 2012-12-26 18:45:46,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:51,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:56,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:01,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:06,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:11,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:16,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:21,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:26,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:31,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:36,996 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:41,996 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to