[jira] [Commented] (ARTEMIS-2713) Master failback can trigger a useless quorum vote on slave failover

ASF subversion and git services (Jira) Wed, 15 Apr 2020 18:57:25 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084475#comment-17084475
 ]


ASF subversion and git services commented on ARTEMIS-2713:
----------------------------------------------------------

Commit 5a829ff1f4b33540e6ddb6f1f0e99ed2a8204f4e in activemq-artemis's branch 
refs/heads/master from Francesco Nigro
[ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=5a829ff ]

ARTEMIS-2713 Master failback can trigger a useless quorum vote on slave failover


> Master failback can trigger a useless quorum vote on slave failover
> -------------------------------------------------------------------
>
>                 Key: ARTEMIS-2713
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2713
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.11.0
>            Reporter: Francesco Nigro
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> A shared nothing replicated master-slave pair using check-for-live-server on 
> master and allow-failback on slave can trigger a (single or several) useless 
> quorum vote during master restart.
> The issue can happen depending on the timing by which some messages are 
> exchanged between the pair: the slave restarting as a backup perform these 
> operations:
> # async send STOP_CALLED on the connection with master used to send the 
> replica files (ie let's call it replication connection)
> # close all the connections with master, but the replication connection 
> (sending a DISCONNECT to the closing ones)
> # async send FAIL_OVER on the replication connection (waiting 5 seconds 
> before giving up and move on)
> # close the replication connection
> The master could receive the DISCONNECT before STOP_CALLED (because are 
> different connections!) believing that the slave isn't going down 
> intentionally: this will make it to fire vote-retries quorum vote. 
> Such quorum vote (in the happy path) should "quickly" complete positively, 
> making master able to fail-over anyway, because the slave is already moved on 
> and (ideally) the other brokers have "enough time" to update their topologies 
> too.
> Although performing an additional quorum vote isn't a bad thing per-se, it 
> could create an unnecessary long time window to await the observing cluster 
> to update their topologies, slowing down an operation that is supposed 
> instead to be completed quickly (on the happy path).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARTEMIS-2713) Master failback can trigger a useless quorum vote on slave failover

Reply via email to