start... cluster using new master

HBase Review Board (JIRA) Fri, 17 Sep 2010 01:46:18 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910509#action_12910509
 ]


HBase Review Board commented on HBASE-3010:
-------------------------------------------

Message from: [email protected]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/873/
-----------------------------------------------------------

Review request for hbase.


Summary
-------

Patch changes catalogtracker to not connect to root on start.  Instead, wait on 
kick from zk before going after root or meta.  This change doesn't address the 
case where a RS can get stuck on itself trying to connect to an RPC that is not 
yet running.  Rather it sidesteps (We should come back and do something about 
non-timeout when setting up proxy since its possible. I'll file an issue on 
it).  Patch includes other fixups, not all cosmetic.

M src/test/java/org/apache/hadoop/hbase/master/TestActiveMasterManager.java
  Add a test for case where master comes up and up in zk the
  master znode contains our address
D src/test/java/org/apache/hadoop/hbase/master/TestMinimumServerCount.java
  Removed test of something we no longer do; wait on an explicit number
  of regions to come in before we'll go ahead w/ master startup.
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  Swapped order in which we do some of the startup (Cosmetic)
M src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
  Javadoc
M src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
  Removed minimumServerCount.  Seems bad predicating master startup
  on N RS's coming in.
  Renamed method numServers as countOfRegionServers and made it protected.
  Removed other unused methods.
  Redid waitForMinServers as waitForRegionServers... where we just
  hang around until count of regionservers stabilizes.  TODO: improve
M src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
  Handle case where the current master znode has our address; in this
  case we can hurry up the expiration by deleting the znode.
M src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
  Minor formatting
M src/main/java/org/apache/hadoop/hbase/master/HMaster.java
  Renamed clusterStarter as freshClusterStartup.  Predicate this boolean
  off the count of regionservers.  If 0, then fresh cluster start.  Else
  do special handling (TODO).
  Edit on HMaster constructor comments.
  Moved some code out of Master constructor into stackIfBackupMaster method
  If aborting set stop flag.
M src/main/java/org/apache/hadoop/hbase/master/HMasterCommandLine.java
  Removed unused imports.
M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
  Make catalogtracker lazy about getting metalocation....don't do it
  on start.


This addresses bug hbase-3010.
    http://issues.apache.org/jira/browse/hbase-3010


Diffs
-----

  src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 2bcd5d0 
  src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java 87fe9cd 
  src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 690f78c 
  src/main/java/org/apache/hadoop/hbase/master/HMaster.java c1b80eb 
  src/main/java/org/apache/hadoop/hbase/master/HMasterCommandLine.java c675db9 
  src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java 498650f 
  src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 528bb9d 
  src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1ec7f4e 
  src/test/java/org/apache/hadoop/hbase/master/TestActiveMasterManager.java 
030bc12 
  src/test/java/org/apache/hadoop/hbase/master/TestMinimumServerCount.java 
d6f2c02 

Diff: http://review.cloudera.org/r/873/diff


Testing
-------

Can now start/stop cluster repeatedly.


Thanks,

stack




> Can't start/stop/start... cluster using new master
> --------------------------------------------------
>
>                 Key: HBASE-3010
>                 URL: https://issues.apache.org/jira/browse/HBASE-3010
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> Currently you might start a small cluster the first time on TRUNK -- i.e. new 
> master -- but second time you do the startup you run into a couple of 
> interesting issues:
> + The old root-region-location is still in place. It gets cleaned later but 
> for a while on startup it does not have the 'right' address.
> + Regionserver (or a client) on startup creates a catalogtracker, a class 
> that notices changes in meta tables keeping up catalog table locations.  
> Starting the catalogtracker results in a check for current catalog locations. 
>  As part of this process, since root-region-location "exists", catalogtracker 
> tries to verify root's location by doing a noop against root host, only, to 
> do this it needs to do the initial rpc proxy setup.  It can so happen that 
> the old root address was that of the current regionserver trying to 
> initialize so we'll be trying to connect to ourself to verify root location 
> ONLY, we're doing this before we've setup the rpcserver and handlers -- so we 
> block, and as it happens there is no timeout on proxy setup (Todd ran into 
> this yesterday, I ran into it today -- its easy to manufacture).
> + So regionserver can't progress.  Meantime the master can't progress because 
> there are no regionservers checking in.  And you can't shut it down because 
> we're not looking at the right 'stop' flag

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3010) Can't start/stop/start... cluster using new master

Reply via email to