[
https://issues.apache.org/jira/browse/HBASE-17305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752410#comment-15752410
]
Esteban Gutierrez commented on HBASE-17305:
-------------------------------------------
Was a regular restart [~enis]. I'm sure this is very rare. What I think is the
culprit here is this:
{code}
blockUntilBecomingActiveMaster() {
...
this.clusterHasActiveMaster.set(true);
...
byte[] bytes = ZKUtil.getDataAndWatch(this.watcher,
this.watcher.znodePaths.masterAddressZNode) <--- [0]
...
currentMaster = ProtobufUtil.parseServerNameFrom(bytes);
...
if (ServerName.isSameHostnameAndPort(currentMaster, this.sn)) {
msg = ("Current master has this master's address, " +
currentMaster + "; master was restarted? Deleting node.");
// Hurry along the expiration of the znode.
ZKUtil.deleteNode(this.watcher,
this.watcher.znodePaths.masterAddressZNode); <--- [1]
// We may have failed to delete the znode at the previous step, but
// we delete the file anyway: a second attempt to delete the znode
is likely to fail again.
ZNodeClearer.deleteMyEphemeralNodeOnDisk();
} else {
...
{code}
I think the problem lies between [0] and [1] when the old master thinks there
was a restart and between [0] and [1] a backup master becomes active. As I
mentioned this happened in a very short time, somewhere around 85ms but it
could be less due clock jitter.
One solution might be to update the znode instead of delete it when the there
is a restart of the active master.
> Two active HBase Masters can run at the same time under certain circumstances
> ------------------------------------------------------------------------------
>
> Key: HBASE-17305
> URL: https://issues.apache.org/jira/browse/HBASE-17305
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 2.0.0
> Reporter: Esteban Gutierrez
> Assignee: Esteban Gutierrez
> Priority: Critical
>
> This needs a little more investigation, but we found a very edgy case when
> the active master is restarted and a stand-by master tries to become active,
> however the original active master was able to become the active master again
> and just before the standby master passed the point of the transition to
> become active we ended up with two active masters running at the same time.
> Assuming the clock on both masters were accurate to milliseconds, this race
> happened in less than 85ms.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)