[
https://issues.apache.org/jira/browse/ARTEMIS-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072441#comment-17072441
]
Francesco Nigro commented on ARTEMIS-2568:
------------------------------------------
A possible workaround would be (in the script that restart the master) to
introduce an artificial delay; such delay should consider:
* vote-retries and vote-retry-wait of the slave before being able to start the
failover eg by default 12 * 5 seconds = 60 seconds + (12 RTT* quorum size)
* time to start: variable time, but can measure a max time depending by the
speed of your disk and the heap size
If you make the script the restart master to pause by the sum of the previous
amount of times you should be safe
> Race condition between failover processing and master restart can cause split
> brain
> -----------------------------------------------------------------------------------
>
> Key: ARTEMIS-2568
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2568
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 2.10.1
> Reporter: Bob Mitchell
> Priority: Major
>
> We have seen split brain in the following sequence of events when using
> replicating backups with failback:
> # Master fails or is shutdown
> # Backup detects failure and starts to failover
> # Master is restarted before Backup becomes "live"
> # It's check for a "duplicate" server fails because backup is not live yet
> # Master and backup both become live.
> At the very least, we would like to see the window for this to occur to be
> reduced, possibly by having the backup check again for the master to be
> available just before going live. It might also be necessary to have the
> master check for a duplicate server as a last step before going live as well.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)