[jira] [Comment Edited] (ARTEMIS-2690) Intermittent network failure caused live and replica to both be live

Jira Thu, 27 Aug 2020 23:56:30 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186313#comment-17186313
 ]


Sebastian Lövdahl edited comment on ARTEMIS-2690 at 8/28/20, 6:55 AM:
----------------------------------------------------------------------

Unfortunately, it looks like explicitly setting `quorum-size` didn't help, we 
have still twice seen the same behaviour where both the live and replica ended 
up being live at the same time. One interesting thing though is that stopping 
and starting the node that is supposed to be the replica (the one that 
erroneously became live) does NOT solve the problem. It starts in live mode 
again, so it seems that somehow it doesn't notice that the actual live node 
isn't running. Does anyone have any ideas? I'm starting to feel kind of lost 
here.

 

One interesting thing I noticed through `netstat` though: it looks like all the 
nodes still have their connections between each other open. So it sounds like 
its just some kind of internal state in the node that's supposed to be live 
that causes this.


was (Author: slovdahl):
Unfortunately, it looks like explicitly setting `quorum-size` didn't help, we 
have still twice seen the same behaviour where both the live and replica ended 
up being live at the same time. One interesting thing though is that stopping 
and starting the node that is supposed to be the replica (the one that 
erroneously became live) does NOT solve the problem. It starts in live mode 
again, so it seems that somehow it doesn't notice that the actual live node 
isn't running. Does anyone have any ideas? I'm starting to feel kind of lost 
here.

> Intermittent network failure caused live and replica to both be live
> --------------------------------------------------------------------
>
>                 Key: ARTEMIS-2690
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2690
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.11.0
>         Environment: Artemis 2.11.0, Ubuntu 18.04
>            Reporter: Sebastian Lövdahl
>            Priority: Major
>         Attachments: live1-artemis.log, live1-broker.xml, live2-artemis.log, 
> live2-broker.xml, live3-artemis.log, live3-broker.xml, replica1-artemis.log, 
> replica1-broker.xml
>
>
> An intermittent network failure caused both the live and replica to be live. 
> Both happily accepted incoming connections until the node that was supposed 
> to be the replica was manually shut down. Log files from all 4 nodes are 
> attached. The {{replica1}} node happened to have some TRACE logging enabled 
> as well.
>  
> As far as I have understood the documentation, the setup should be safe from 
> a split brain point of view. The live2 and live3 nodes intentionally don't 
> have any replicas at the moment. Complete {{broker.xml}} files are attached, 
> but for reference, this is the {{ha-policy}}:
> live1:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>       <cluster-name>my-cluster</cluster-name>
>       <group-n ame>group1</group-name>
>       <check-for-live-server>true</check-for-live-server>
>       <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}
> replica1:
> {code:xml}
> <ha-policy>
>   <replication>
>     <slave>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group1</group-name>
>        <allow-failback>true</allow-failback>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </slave>
>   </replication>
> </ha-policy>
> {code}
> live2:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group2</group-name>
>        <check-for-live-server>true</check-for-live-server>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}
> live3:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group2</group-name>
>        <check-for-live-server>true</check-for-live-server>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARTEMIS-2690) Intermittent network failure caused live and replica to both be live

Reply via email to