[jira] [Commented] (HBASE-19906) TestZooKeeper Timeout

stack (JIRA) Wed, 31 Jan 2018 14:02:34 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-19906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347681#comment-16347681
 ]


stack commented on HBASE-19906:
-------------------------------

{code}
2018-01-31 19:06:42,412 ERROR [Time-limited test-EventThread] 
zookeeper.ClientCnxn$EventThread(532): Error while calling watcher
java.lang.NullPointerException
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.handleMetaRITOnCrashedServer(AssignmentManager.java:1898)
  at 
org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:573)
  at 
org.apache.hadoop.hbase.master.RegionServerTracker.nodeDeleted(RegionServerTracker.java:131)
  at org.apache.hadoop.hbase.zookeeper.ZKWatcher.process(ZKWatcher.java:467)
  at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
{code}

Above means the outstanding latch in HMaster won't be triggered by clean-up of 
the running RecoverMetaProcedure. We see it in at least two of the 
TestZooKeeper timeout logs.

I think we see this issue now because of this change to Master made by the 
below commit by me:

{code}
@@ -875,9 +870,6 @@ public class HMaster extends HRegionServer implements 
MasterServices {
     MasterMetaBootstrap metaBootstrap = createMetaBootstrap(this, status);
     metaBootstrap.recoverMeta();

-    // check if master is shutting down because above assignMeta could return 
even hbase:meta isn't
-    // assigned when master is shutting down
-    if (isStopped()) return;

     //Initialize after meta as it scans meta
     if (favoredNodesManager != null) {
{code}

{code}
commit 77607e496108ff9aac3d5719fca0c12449df1637
Author: Michael Stack <[email protected]>
Date:   Wed Jan 24 21:36:57 2018 -0800

    HBASE-19840 Flakey TestMetaWithReplicas
{code}

The timeout is in the shutdown of the cluster. We usually wouldn't get as far 
as recovering meta because the isStopped would have tripped.

Putting back the isStopped and adding check for null value so the cleanup of 
the procedure can go through (and undo the latch in master) should we get this 
far again.



> TestZooKeeper Timeout
> ---------------------
>
>                 Key: HBASE-19906
>                 URL: https://issues.apache.org/jira/browse/HBASE-19906
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.0.0-beta-2
>
>
> TestZooKeeper is timing out causing hbase2 failures and breaking 
> HBASE-Flaky-Tests-branch2.0.0.
> -------------------------------------------------------------------------------
> Test set: org.apache.hadoop.hbase.TestZooKeeper
> -------------------------------------------------------------------------------
> Tests run: 6, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 600.8 s <<< 
> FAILURE! - in org.apache.hadoop.hbase.TestZooKeeper
> org.apache.hadoop.hbase.TestZooKeeper  Time elapsed: 551.041 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 600 
> seconds
>       at org.apache.hadoop.hbase.TestZooKeeper.after(TestZooKeeper.java:103)
> org.apache.hadoop.hbase.TestZooKeeper  Time elapsed: 551.046 s  <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread 
> NIOServerCxn.Factory:0.0.0.0/0.0.0.0:59935
> Not always though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-19906) TestZooKeeper Timeout

Reply via email to