[ 
https://issues.apache.org/jira/browse/HBASE-17305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752410#comment-15752410
 ] 

Esteban Gutierrez commented on HBASE-17305:
-------------------------------------------

Was a regular restart [~enis]. I'm sure this is very rare. What I think is the 
culprit here is this:

{code}
blockUntilBecomingActiveMaster() {
...
        this.clusterHasActiveMaster.set(true);
...
byte[] bytes = ZKUtil.getDataAndWatch(this.watcher, 
this.watcher.znodePaths.masterAddressZNode) <--- [0]
...
currentMaster = ProtobufUtil.parseServerNameFrom(bytes);
...
if (ServerName.isSameHostnameAndPort(currentMaster, this.sn)) { 
            msg = ("Current master has this master's address, " +
              currentMaster + "; master was restarted? Deleting node.");
            // Hurry along the expiration of the znode.
            ZKUtil.deleteNode(this.watcher, 
this.watcher.znodePaths.masterAddressZNode); <--- [1]

            // We may have failed to delete the znode at the previous step, but
            //  we delete the file anyway: a second attempt to delete the znode 
is likely to fail again.
            ZNodeClearer.deleteMyEphemeralNodeOnDisk();
          } else {
...
{code}

I think the problem lies between [0] and [1] when the old master thinks there 
was a restart and between [0] and [1] a backup master becomes active. As I 
mentioned this happened in a very short time, somewhere around 85ms but it 
could be less due clock jitter. 

One solution might be to update the znode instead of delete it when the there 
is a restart of the active master.


> Two active HBase Masters can run at the same time under certain circumstances 
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-17305
>                 URL: https://issues.apache.org/jira/browse/HBASE-17305
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.0.0
>            Reporter: Esteban Gutierrez
>            Assignee: Esteban Gutierrez
>            Priority: Critical
>
> This needs a little more investigation, but we found a very edgy case when 
> the active master is restarted and a stand-by master tries to become active, 
> however the original active master was able to become the active master again 
> and just before the standby master passed the point of the transition to 
> become active we ended up with two active masters running at the same time. 
> Assuming the clock on both masters were accurate to milliseconds, this race 
> happened in less than 85ms. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to