[
https://issues.apache.org/jira/browse/HBASE-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540418#comment-13540418
]
terry zhang commented on HBASE-7451:
------------------------------------
My solution is simple, Just notify snapshotErrorListener to threw a exception
to let all GloballyConsistentRegionLockTask to quit and release the region
lock.
{code:title=SnapshotOperation.java|borderStyle=solid}
@Override
public void cleanup(Exception e) {
LOG.debug("Cleanup snapshot - handled in sub-tasks on error");
+ snapshotErrorListener.receiveError("Cleanup snapshot", new
HBaseSnapshotException(e.getMessage()));
}
{code}
> [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout
> happen
> ------------------------------------------------------------------------------------
>
> Key: HBASE-7451
> URL: https://issues.apache.org/jira/browse/HBASE-7451
> Project: HBase
> Issue Type: Bug
> Components: snapshots
> Reporter: terry zhang
> Assignee: terry zhang
>
> Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in github
> : https://github.com/matteobertozzi/hbase/
> If we create a snapshot and meet regionserver timeout. Rs will be lock and
> can not put any data. Please take a look at log below :
> // regionserver snapshot timeout
> org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitException:
> org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
> Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
> at
> org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java:135)
> at
> org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDispatcher.java:71)
> at
> org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:92)
> at
> org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:89)
> at
> org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.run(OperationAttemptTimer.java:71)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> Caused by:
> org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
> Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
> ... 3 more
> 2012-12-26 18:44:57,211 DEBUG
> org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
> 2012-12-26 18:44:57,211 DEBUG
> org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation:
> Cleanup snapshot - handled in sub-tasks on error
> 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv
> //Waiting for 'commit allowed' latch and do not exist
> 2012-12-26 18:44:57,211 DEBUG
> org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
> 2012-12-26 18:44:57,211 DEBUG
> org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation:
> Cleanup snapshot - handled in sub-tasks on error
> 2012-12-26 18:44:57,212 DEBUG
> org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running finish phase.
> 2012-12-26 18:44:57,212 DEBUG
> org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation:
> Finish snapshot - handling in subtasks on error
> 2012-12-26 18:44:57,212 WARN
> org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer: Timer
> already marked completed, ignoring!
> 2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:17,002 INFO
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
> Received children changed event:/hbase-TERRY-73/online-snapshot/prepare
> 2012-12-26 18:45:17,002 INFO
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
> Recieved start event.
> 2012-12-26 18:45:17,002 DEBUG
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
> Looking for new operations under
> znode:/hbase-TERRY-73/online-snapshot/prepare
> 2012-12-26 18:45:17,003 INFO
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
> Received children changed event:/hbase-TERRY-73/online-snapshot/abort
> 2012-12-26 18:45:17,003 INFO
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
> Recieved abort event.
> 2012-12-26 18:45:17,003 DEBUG
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
> Checking for aborted operations on node:/hbase-TERRY-73/online-snapshot/abort
> 2012-12-26 18:45:21,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:26,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:31,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:36,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:41,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:43,481 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
> LRU Stats: total=11.77 MB, free=1.39 GB, max=1.4 GB, blocks=5, accesses=96,
> hits=91, hitRatio=94.79%, cachingAccesses=96, cachingHits=91,
> cachingHitsRatio=94.79%, evictions=0, evicted=0, evictedPerRun=NaN
> 2012-12-26 18:45:46,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:51,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:56,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:01,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:06,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:11,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:16,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:21,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:26,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:31,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:36,996 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:41,996 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting
> for 'commit allowed' latch. (sleep:5000 ms)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira