[jira] [Commented] (HBASE-8519) Backup master will never come up if primary master dies during initialization

Jean-Daniel Cryans (JIRA) Fri, 10 May 2013 14:25:16 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13654859#comment-13654859
 ]


Jean-Daniel Cryans commented on HBASE-8519:
-------------------------------------------

What this jira describes is not the only failure mode. If you kill the master 
while it's shutting down the cluster and you try to restart HBase, it will have 
the same error and will go down (and the cluster won't even come up).

All you need is the master and the shutdown znodes in place when starting 
HMaster.

There's a weird misconnection here:

 # If you start the cluster on a clean ZK, you won't find either znodes so you 
just start.
 # If you start the cluster and the shutdown znode exists but not the master 
znode, *you just clean it*:
{code}
    // Set the cluster as up.  If new RSs, they'll be waiting on this before
    // going ahead with their startup.
    boolean wasUp = this.clusterStatusTracker.isClusterUp();
    if (!wasUp) this.clusterStatusTracker.setClusterUp();
{code}
 # If you start the cluster and the master znode exists but not the shutdown, 
you are a backup master.
 # Finally, this jira, you have both so you assume the cluster is shutting down 
and you were meant to be a backup cluster.

The tricky part here is that 2 and 4 are almost the same, except that you want 
to handle the case where the whole cluster is shutting down while you're 
waiting. Maybe we could check if both znodes are there before waiting and then 
recognize the situation as being a start and not a stop, but I'm sure that's 
going to screw someone at some point too...

Comments?
                
> Backup master will never come up if primary master dies during initialization
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-8519
>                 URL: https://issues.apache.org/jira/browse/HBASE-8519
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.94.7, 0.95.0
>            Reporter: Jerry He
>            Assignee: Jerry He
>            Priority: Minor
>             Fix For: 0.98.0
>
>
> The problem happens if primary master dies after becoming master but before 
> it completes initialization and calls clusterStatusTracker.setClusterUp(),
> The backup master will try to become the master, but will shutdown itself 
> promptly because it sees 'the cluster is not up'.
> This is the backup master log:
> 2013-05-09 15:08:05,568 INFO 
> org.apache.hadoop.hbase.master.metrics.MasterMetrics: Initialized
> 2013-05-09 15:08:05,573 DEBUG org.apache.hadoop.hbase.master.HMaster: HMaster 
> started in backup mode.  Stalling until master znode is written.
> 2013-05-09 15:08:05,589 INFO 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/master 
> already exists and this is not a retry
> 2013-05-09 15:08:05,590 INFO 
> org.apache.hadoop.hbase.master.ActiveMasterManager: Adding ZNode for 
> /hbase/backup-masters/xxx.com,60000,1368137285373 in backup master directory
> 2013-05-09 15:08:05,595 INFO 
> org.apache.hadoop.hbase.master.ActiveMasterManager: Another master is the 
> active master, xxx.com,60000,1368137283107; waiting to become the next active 
> master
> 2013-05-09 15:09:45,006 DEBUG 
> org.apache.hadoop.hbase.master.ActiveMasterManager: No master available. 
> Notifying waiting threads
> 2013-05-09 15:09:45,006 INFO org.apache.hadoop.hbase.master.HMaster: Cluster 
> went down before this master became active
> 2013-05-09 15:09:45,006 DEBUG org.apache.hadoop.hbase.master.HMaster: 
> Stopping service threads
> 2013-05-09 15:09:45,006 INFO org.apache.hadoop.ipc.HBaseServer: Stopping 
> server on 60000
>  
> In ActiveMasterManager::blockUntilBecomingActiveMaster()
> {code}
>   ..
>   if (!clusterStatusTracker.isClusterUp()) {
>           this.master.stop(
>             "Cluster went down before this master became active");
>         }
>   ..
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8519) Backup master will never come up if primary master dies during initialization

Reply via email to