[
https://issues.apache.org/jira/browse/ARTEMIS-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072473#comment-17072473
]
Sebastian Lövdahl commented on ARTEMIS-2568:
--------------------------------------------
[[email protected]] Thanks for the tips, much appreciated!
It looks I might have jumped to conclusions a bit too fast this time. Although
we have observed something like the ticket descibes in the past, this time it
doesn't seem to be what was happened. It looks like there was a brief loss of
network connectivity in the cluster, and without any of the nodes restarting,
both ended up as being masters. Sounds a bit more alarming to me to be honest,
but this is likely not the correct ticket to have that discussion on.
> Race condition between failover processing and master restart can cause split
> brain
> -----------------------------------------------------------------------------------
>
> Key: ARTEMIS-2568
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2568
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 2.10.1
> Reporter: Bob Mitchell
> Priority: Major
>
> We have seen split brain in the following sequence of events when using
> replicating backups with failback:
> # Master fails or is shutdown
> # Backup detects failure and starts to failover
> # Master is restarted before Backup becomes "live"
> # It's check for a "duplicate" server fails because backup is not live yet
> # Master and backup both become live.
> At the very least, we would like to see the window for this to occur to be
> reduced, possibly by having the backup check again for the master to be
> available just before going live. It might also be necessary to have the
> master check for a duplicate server as a last step before going live as well.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)