[ 
https://issues.apache.org/jira/browse/HBASE-7778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572820#comment-13572820
 ] 

Jonathan Hsieh commented on HBASE-7778:
---------------------------------------

HBASE-7475 adds a Thread.currentThread().interrupt() to JVMClusterUtil#shutdown 
call which may be the culprit.  Not clear to my why it was added there, but I 
don't think just removing that is the correct solution either.  My guess is 
that it prevent the hang that [~nkeywal] encountered in that issue.

This has to do with attempting to delete the rs's epheramal ZK node. (which 
times out and likely forces timeout/interrupt exit)

{code}
2013-02-06 12:54:56,249 WARN  [RegionServer:0;localhost,57007,1360184089426] 
zookeeper.RecoverableZooKeeper(226): Possibly transient ZooKeeper exception: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/rs/localhost,57007,1360184089426
2013-02-06 12:54:56,250 INFO  [RegionServer:0;localhost,57007,1360184089426] 
util.RetryCounter(54): Sleeping 2000ms before retry #1...
2013-02-06 12:54:58,251 WARN  [RegionServer:0;localhost,57007,1360184089426] 
zookeeper.RecoverableZooKeeper(226): Possibly transient ZooKeeper exception: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/rs/localhost,57007,1360184089426
2013-02-06 12:54:58,251 INFO  [RegionServer:0;localhost,57007,1360184089426] 
util.RetryCounter(54): Sleeping 4000ms before retry #2...
2013-02-06 12:55:02,252 WARN  [RegionServer:0;localhost,57007,1360184089426] 
zookeeper.RecoverableZooKeeper(226): Possibly transient ZooKeeper exception: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/rs/localhost,57007,1360184089426
2013-02-06 12:55:02,252 INFO  [RegionServer:0;localhost,57007,1360184089426] 
util.RetryCounter(54): Sleeping 8000ms before retry #3...
2013-02-06 12:55:10,253 WARN  [RegionServer:0;localhost,57007,1360184089426] 
zookeeper.RecoverableZooKeeper(226): Possibly transient ZooKeeper exception: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/rs/localhost,57007,1360184089426
2013-02-06 12:55:10,253 ERROR [RegionServer:0;localhost,57007,1360184089426] 
zookeeper.RecoverableZooKeeper(228): ZooKeeper delete failed after 3 retries
2013-02-06 12:55:10,254 WARN  [RegionServer:0;localhost,57007,1360184089426] 
regionserver.HRegionServer(1012): Failed deleting my ephemeral node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/rs/localhost,57007,1360184089426
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
        at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:141)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1222)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1211)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1263)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1010)
        at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.runRegionServer(MiniHBaseCluster.java:151)
        at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$0(MiniHBaseCluster.java:150)
        at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$1.run(MiniHBaseCluster.java:135)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:337)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1118)
...
{code}

digging more..
                
> [snapshot 130201 merge] Tests with sleep after minicluster shutdown fail due 
> to interrupt flag.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-7778
>                 URL: https://issues.apache.org/jira/browse/HBASE-7778
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Jonathan Hsieh
>
> Something in the merge has set the interrupted flag on the main test threads 
> of TestReplicationDisabledinactivePeer, TestRestartCluster, and 
> TestCatalogTrackerOnCluster.  
> These unacceptable hacks make the tests run and pass: 
> {code}
> diff --git 
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTrackerOnCluster.java
>  b/hbase-server/src/test/java/or
> index f3e57d6..a8d2ef7 100644
> --- 
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTrackerOnCluster.java
> +++ 
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTrackerOnCluster.java
> @@ -47,6 +47,7 @@ public class TestCatalogTrackerOnCluster {
>      // Shutdown hbase.
>      UTIL.shutdownMiniHBaseCluster();
>      // Give the various ZKWatchers some time to settle their affairs.
> +    Thread.interrupted(); // HACK clear interrupt state.
>      Thread.sleep(1000);
>  
>      // Mess with the root location in the running zk.  Set it to be nonsense.
> diff --git 
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestRestartCluster.java
>  b/hbase-server/src/test/java/org/apache/h
> index 15225e1..9f7f526 100644
> --- 
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestRestartCluster.java
> +++ 
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestRestartCluster.java
> @@ -108,6 +108,7 @@ public class TestRestartCluster {
>      UTIL.shutdownMiniHBaseCluster();
>  
>      LOG.info("\n\nSleeping a bit");
> +    Thread.interrupted(); // HACK clear interrupt state.
>      Thread.sleep(2000);
>  
>      LOG.info("\n\nStarting cluster the second time");
> diff --git 
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationDisableInactivePeer.java
>  b/hbase-server/src/t
> index b089fbe..8162f4b 100644
> --- 
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationDisableInactivePeer.java
> +++ 
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationDisableInactivePeer.java
> @@ -50,6 +50,7 @@ public class TestReplicationDisableInactivePeer extends 
> TestReplicationBase {
>      // enabling and shutdown the peer
>      admin.enablePeer("2");
>      utility2.shutdownMiniHBaseCluster();
> +    Thread.interrupted(); // HACK clear interrupted flag.
>  
>      byte[] rowkey = Bytes.toBytes("disable inactive peer");
>      Put put = new Put(rowkey);
> {code}
> On the snapshot branch and on the trunk branch before the merge, these tests 
> passed. Need to figure out how they combination caused this behavior change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to