[jira] [Comment Edited] (ARTEMIS-2568) Race condition between failover processing and master restart can cause split brain

Francesco Nigro (Jira) Tue, 31 Mar 2020 23:52:39 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072441#comment-17072441
 ]


Francesco Nigro edited comment on ARTEMIS-2568 at 4/1/20, 6:51 AM:
-------------------------------------------------------------------

A possible workaround would be (in the script that restart the master) to 
introduce an artificial delay; such delay should consider:
* vote-retries and vote-retry-wait of the slave before being able to start the 
failover eg by default 12 * 5 seconds = 60 seconds + (12 RTT* quorum size)
* time to start: variable time,  but you can consider the elapsed max time you 
could experience depending by the speed of your disk and the heap size 
If you make the script the restart master to pause by the sum of the previous 
amount of times you should be safe 


was (Author: [email protected]):
A possible workaround would be (in the script that restart the master) to 
introduce an artificial delay; such delay should consider:
* vote-retries and vote-retry-wait of the slave before being able to start the 
failover eg by default 12 * 5 seconds = 60 seconds + (12 RTT* quorum size)
* time to start: variable time,  but can measure a max time  depending by the 
speed of your disk and the heap size 
If you make the script the restart master to pause by the sum of the previous 
amount of times you should be safe 

> Race condition between failover processing and master restart can cause split 
> brain
> -----------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-2568
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2568
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.10.1
>            Reporter: Bob Mitchell
>            Priority: Major
>
> We have seen split brain in the following sequence of events when using 
> replicating backups with failback:
>  # Master fails or is shutdown
>  # Backup detects failure and starts to failover
>  # Master is restarted before Backup becomes "live"
>  # It's check for a "duplicate" server fails because backup is not live yet
>  # Master and backup both become live.
> At the very least, we would like to see the window for this to occur to be 
> reduced, possibly by having the backup check again for the master to be 
> available just before going live.  It might also be necessary to have the 
> master check for a duplicate server as a last step before going live as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARTEMIS-2568) Race condition between failover processing and master restart can cause split brain

Reply via email to