[ 
https://issues.apache.org/jira/browse/HBASE-19906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16348046#comment-16348046
 ] 

stack commented on HBASE-19906:
-------------------------------

[~appy] Yeah, Master startup (Master?) is finicky. Needs a redo (See 
HBASE-19831).

Here's some spew on the topic: Master-is-a-RegionServer (subclass) was hacked 
in. So much of Master function happens in the startup phase before we get to 
'run'. Master waits on RegionServers in startup before moving on to run. Meta 
is also assigned as part of Master startup as are Namespace tables. So, Master 
startup cannot complete w/o other servers reporting in and after it has done a 
bunch of RPC. Master also puts up 'services' on startup, many of them. How 
Services are started depends. Chores are done one way. ZK-using services 
another. We started to subclass Guava Service -- which has lots of nice 
facility (doing it async, reporting status, common form and start/stop...). And 
so on. So, lots of Services. Some you shutdown. Some you Stop.  Some go down 
when cluster is set to down. Others only go down after an abort and when we are 
in the Server exit sequence. Some Services need interrupt (RPC or sleeping 
threads). Some have a latch so they are for sure single-stepping it, that must 
be undone. Others are synchronized (interesting recent one where Master does 
reportForDuty to itself but it has locked itself out when Master is supposed to 
host Regions). This makes Master startup a long sequence of ops, waits on 
external service.... Would be nice to redo after years of accumulation. One 
nice recent redo was done by [~uagashe]... He took all the places we did region 
assign -- there were at least two versions of this function -- and he put them 
into a single Pv2 Procedure with the name RecoverMetaProcedure -- we should 
rename it (smile), make it prettier -- but now meta assign is done one way 
only... its great -- in this patch we actually fix a bug in hbase:meta assign 
... and in one place only (This Procedure doesn't work for meta region replicas 
though... another hack-in). We need more of this pattern... With the Procedure 
redo, we can move meta assign out of Master signup. Then we won't have to wait 
on Regions to come in before we get to the 'run' phase. Once in run phase, 
shutdown is no longer special-casing -- i.e. the check for stop in startup as 
this patch adds back -- or other fixups so Master startup sequence notices we 
are in cluster shutdown and it needs to go down.



> TestZooKeeper Timeout
> ---------------------
>
>                 Key: HBASE-19906
>                 URL: https://issues.apache.org/jira/browse/HBASE-19906
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.0.0-beta-2
>
>         Attachments: HBASE-19906.branch-2.001.patch, 
> HBASE-19906.branch-2.002.patch, HBASE-19906.branch-2.003.patch, 
> HBASE-19906.branch-2.003.patch
>
>
> TestZooKeeper is timing out causing hbase2 failures and breaking 
> HBASE-Flaky-Tests-branch2.0.0.
> -------------------------------------------------------------------------------
> Test set: org.apache.hadoop.hbase.TestZooKeeper
> -------------------------------------------------------------------------------
> Tests run: 6, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 600.8 s <<< 
> FAILURE! - in org.apache.hadoop.hbase.TestZooKeeper
> org.apache.hadoop.hbase.TestZooKeeper  Time elapsed: 551.041 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 600 
> seconds
>       at org.apache.hadoop.hbase.TestZooKeeper.after(TestZooKeeper.java:103)
> org.apache.hadoop.hbase.TestZooKeeper  Time elapsed: 551.046 s  <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread 
> NIOServerCxn.Factory:0.0.0.0/0.0.0.0:59935
> Not always though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to