[
https://issues.apache.org/jira/browse/HBASE-19906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16348046#comment-16348046
]
stack commented on HBASE-19906:
-------------------------------
[~appy] Yeah, Master startup (Master?) is finicky. Needs a redo (See
HBASE-19831).
Here's some spew on the topic: Master-is-a-RegionServer (subclass) was hacked
in. So much of Master function happens in the startup phase before we get to
'run'. Master waits on RegionServers in startup before moving on to run. Meta
is also assigned as part of Master startup as are Namespace tables. So, Master
startup cannot complete w/o other servers reporting in and after it has done a
bunch of RPC. Master also puts up 'services' on startup, many of them. How
Services are started depends. Chores are done one way. ZK-using services
another. We started to subclass Guava Service -- which has lots of nice
facility (doing it async, reporting status, common form and start/stop...). And
so on. So, lots of Services. Some you shutdown. Some you Stop. Some go down
when cluster is set to down. Others only go down after an abort and when we are
in the Server exit sequence. Some Services need interrupt (RPC or sleeping
threads). Some have a latch so they are for sure single-stepping it, that must
be undone. Others are synchronized (interesting recent one where Master does
reportForDuty to itself but it has locked itself out when Master is supposed to
host Regions). This makes Master startup a long sequence of ops, waits on
external service.... Would be nice to redo after years of accumulation. One
nice recent redo was done by [~uagashe]... He took all the places we did region
assign -- there were at least two versions of this function -- and he put them
into a single Pv2 Procedure with the name RecoverMetaProcedure -- we should
rename it (smile), make it prettier -- but now meta assign is done one way
only... its great -- in this patch we actually fix a bug in hbase:meta assign
... and in one place only (This Procedure doesn't work for meta region replicas
though... another hack-in). We need more of this pattern... With the Procedure
redo, we can move meta assign out of Master signup. Then we won't have to wait
on Regions to come in before we get to the 'run' phase. Once in run phase,
shutdown is no longer special-casing -- i.e. the check for stop in startup as
this patch adds back -- or other fixups so Master startup sequence notices we
are in cluster shutdown and it needs to go down.
> TestZooKeeper Timeout
> ---------------------
>
> Key: HBASE-19906
> URL: https://issues.apache.org/jira/browse/HBASE-19906
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: stack
> Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19906.branch-2.001.patch,
> HBASE-19906.branch-2.002.patch, HBASE-19906.branch-2.003.patch,
> HBASE-19906.branch-2.003.patch
>
>
> TestZooKeeper is timing out causing hbase2 failures and breaking
> HBASE-Flaky-Tests-branch2.0.0.
> -------------------------------------------------------------------------------
> Test set: org.apache.hadoop.hbase.TestZooKeeper
> -------------------------------------------------------------------------------
> Tests run: 6, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 600.8 s <<<
> FAILURE! - in org.apache.hadoop.hbase.TestZooKeeper
> org.apache.hadoop.hbase.TestZooKeeper Time elapsed: 551.041 s <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 600
> seconds
> at org.apache.hadoop.hbase.TestZooKeeper.after(TestZooKeeper.java:103)
> org.apache.hadoop.hbase.TestZooKeeper Time elapsed: 551.046 s <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread
> NIOServerCxn.Factory:0.0.0.0/0.0.0.0:59935
> Not always though.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)