[
https://issues.apache.org/jira/browse/HBASE-19906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347681#comment-16347681
]
stack commented on HBASE-19906:
-------------------------------
{code}
2018-01-31 19:06:42,412 ERROR [Time-limited test-EventThread]
zookeeper.ClientCnxn$EventThread(532): Error while calling watcher
java.lang.NullPointerException
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.handleMetaRITOnCrashedServer(AssignmentManager.java:1898)
at
org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:573)
at
org.apache.hadoop.hbase.master.RegionServerTracker.nodeDeleted(RegionServerTracker.java:131)
at org.apache.hadoop.hbase.zookeeper.ZKWatcher.process(ZKWatcher.java:467)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
{code}
Above means the outstanding latch in HMaster won't be triggered by clean-up of
the running RecoverMetaProcedure. We see it in at least two of the
TestZooKeeper timeout logs.
I think we see this issue now because of this change to Master made by the
below commit by me:
{code}
@@ -875,9 +870,6 @@ public class HMaster extends HRegionServer implements
MasterServices {
MasterMetaBootstrap metaBootstrap = createMetaBootstrap(this, status);
metaBootstrap.recoverMeta();
- // check if master is shutting down because above assignMeta could return
even hbase:meta isn't
- // assigned when master is shutting down
- if (isStopped()) return;
//Initialize after meta as it scans meta
if (favoredNodesManager != null) {
{code}
{code}
commit 77607e496108ff9aac3d5719fca0c12449df1637
Author: Michael Stack <[email protected]>
Date: Wed Jan 24 21:36:57 2018 -0800
HBASE-19840 Flakey TestMetaWithReplicas
{code}
The timeout is in the shutdown of the cluster. We usually wouldn't get as far
as recovering meta because the isStopped would have tripped.
Putting back the isStopped and adding check for null value so the cleanup of
the procedure can go through (and undo the latch in master) should we get this
far again.
> TestZooKeeper Timeout
> ---------------------
>
> Key: HBASE-19906
> URL: https://issues.apache.org/jira/browse/HBASE-19906
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: stack
> Priority: Major
> Fix For: 2.0.0-beta-2
>
>
> TestZooKeeper is timing out causing hbase2 failures and breaking
> HBASE-Flaky-Tests-branch2.0.0.
> -------------------------------------------------------------------------------
> Test set: org.apache.hadoop.hbase.TestZooKeeper
> -------------------------------------------------------------------------------
> Tests run: 6, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 600.8 s <<<
> FAILURE! - in org.apache.hadoop.hbase.TestZooKeeper
> org.apache.hadoop.hbase.TestZooKeeper Time elapsed: 551.041 s <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 600
> seconds
> at org.apache.hadoop.hbase.TestZooKeeper.after(TestZooKeeper.java:103)
> org.apache.hadoop.hbase.TestZooKeeper Time elapsed: 551.046 s <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread
> NIOServerCxn.Factory:0.0.0.0/0.0.0.0:59935
> Not always though.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)