[
https://issues.apache.org/jira/browse/HBASE-7778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572820#comment-13572820
]
Jonathan Hsieh commented on HBASE-7778:
---------------------------------------
HBASE-7475 adds a Thread.currentThread().interrupt() to JVMClusterUtil#shutdown
call which may be the culprit. Not clear to my why it was added there, but I
don't think just removing that is the correct solution either. My guess is
that it prevent the hang that [~nkeywal] encountered in that issue.
This has to do with attempting to delete the rs's epheramal ZK node. (which
times out and likely forces timeout/interrupt exit)
{code}
2013-02-06 12:54:56,249 WARN [RegionServer:0;localhost,57007,1360184089426]
zookeeper.RecoverableZooKeeper(226): Possibly transient ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /hbase/rs/localhost,57007,1360184089426
2013-02-06 12:54:56,250 INFO [RegionServer:0;localhost,57007,1360184089426]
util.RetryCounter(54): Sleeping 2000ms before retry #1...
2013-02-06 12:54:58,251 WARN [RegionServer:0;localhost,57007,1360184089426]
zookeeper.RecoverableZooKeeper(226): Possibly transient ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /hbase/rs/localhost,57007,1360184089426
2013-02-06 12:54:58,251 INFO [RegionServer:0;localhost,57007,1360184089426]
util.RetryCounter(54): Sleeping 4000ms before retry #2...
2013-02-06 12:55:02,252 WARN [RegionServer:0;localhost,57007,1360184089426]
zookeeper.RecoverableZooKeeper(226): Possibly transient ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /hbase/rs/localhost,57007,1360184089426
2013-02-06 12:55:02,252 INFO [RegionServer:0;localhost,57007,1360184089426]
util.RetryCounter(54): Sleeping 8000ms before retry #3...
2013-02-06 12:55:10,253 WARN [RegionServer:0;localhost,57007,1360184089426]
zookeeper.RecoverableZooKeeper(226): Possibly transient ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /hbase/rs/localhost,57007,1360184089426
2013-02-06 12:55:10,253 ERROR [RegionServer:0;localhost,57007,1360184089426]
zookeeper.RecoverableZooKeeper(228): ZooKeeper delete failed after 3 retries
2013-02-06 12:55:10,254 WARN [RegionServer:0;localhost,57007,1360184089426]
regionserver.HRegionServer(1012): Failed deleting my ephemeral node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /hbase/rs/localhost,57007,1360184089426
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:141)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1222)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1211)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1263)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1010)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.runRegionServer(MiniHBaseCluster.java:151)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$0(MiniHBaseCluster.java:150)
at
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$1.run(MiniHBaseCluster.java:135)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:337)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1118)
...
{code}
digging more..
> [snapshot 130201 merge] Tests with sleep after minicluster shutdown fail due
> to interrupt flag.
> -----------------------------------------------------------------------------------------------
>
> Key: HBASE-7778
> URL: https://issues.apache.org/jira/browse/HBASE-7778
> Project: HBase
> Issue Type: Sub-task
> Reporter: Jonathan Hsieh
>
> Something in the merge has set the interrupted flag on the main test threads
> of TestReplicationDisabledinactivePeer, TestRestartCluster, and
> TestCatalogTrackerOnCluster.
> These unacceptable hacks make the tests run and pass:
> {code}
> diff --git
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTrackerOnCluster.java
> b/hbase-server/src/test/java/or
> index f3e57d6..a8d2ef7 100644
> ---
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTrackerOnCluster.java
> +++
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTrackerOnCluster.java
> @@ -47,6 +47,7 @@ public class TestCatalogTrackerOnCluster {
> // Shutdown hbase.
> UTIL.shutdownMiniHBaseCluster();
> // Give the various ZKWatchers some time to settle their affairs.
> + Thread.interrupted(); // HACK clear interrupt state.
> Thread.sleep(1000);
>
> // Mess with the root location in the running zk. Set it to be nonsense.
> diff --git
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestRestartCluster.java
> b/hbase-server/src/test/java/org/apache/h
> index 15225e1..9f7f526 100644
> ---
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestRestartCluster.java
> +++
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestRestartCluster.java
> @@ -108,6 +108,7 @@ public class TestRestartCluster {
> UTIL.shutdownMiniHBaseCluster();
>
> LOG.info("\n\nSleeping a bit");
> + Thread.interrupted(); // HACK clear interrupt state.
> Thread.sleep(2000);
>
> LOG.info("\n\nStarting cluster the second time");
> diff --git
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationDisableInactivePeer.java
> b/hbase-server/src/t
> index b089fbe..8162f4b 100644
> ---
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationDisableInactivePeer.java
> +++
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationDisableInactivePeer.java
> @@ -50,6 +50,7 @@ public class TestReplicationDisableInactivePeer extends
> TestReplicationBase {
> // enabling and shutdown the peer
> admin.enablePeer("2");
> utility2.shutdownMiniHBaseCluster();
> + Thread.interrupted(); // HACK clear interrupted flag.
>
> byte[] rowkey = Bytes.toBytes("disable inactive peer");
> Put put = new Put(rowkey);
> {code}
> On the snapshot branch and on the trunk branch before the merge, these tests
> passed. Need to figure out how they combination caused this behavior change.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira