[
https://issues.apache.org/jira/browse/ARTEMIS-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084475#comment-17084475
]
ASF subversion and git services commented on ARTEMIS-2713:
----------------------------------------------------------
Commit 5a829ff1f4b33540e6ddb6f1f0e99ed2a8204f4e in activemq-artemis's branch
refs/heads/master from Francesco Nigro
[ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=5a829ff ]
ARTEMIS-2713 Master failback can trigger a useless quorum vote on slave failover
> Master failback can trigger a useless quorum vote on slave failover
> -------------------------------------------------------------------
>
> Key: ARTEMIS-2713
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2713
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker
> Affects Versions: 2.11.0
> Reporter: Francesco Nigro
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> A shared nothing replicated master-slave pair using check-for-live-server on
> master and allow-failback on slave can trigger a (single or several) useless
> quorum vote during master restart.
> The issue can happen depending on the timing by which some messages are
> exchanged between the pair: the slave restarting as a backup perform these
> operations:
> # async send STOP_CALLED on the connection with master used to send the
> replica files (ie let's call it replication connection)
> # close all the connections with master, but the replication connection
> (sending a DISCONNECT to the closing ones)
> # async send FAIL_OVER on the replication connection (waiting 5 seconds
> before giving up and move on)
> # close the replication connection
> The master could receive the DISCONNECT before STOP_CALLED (because are
> different connections!) believing that the slave isn't going down
> intentionally: this will make it to fire vote-retries quorum vote.
> Such quorum vote (in the happy path) should "quickly" complete positively,
> making master able to fail-over anyway, because the slave is already moved on
> and (ideally) the other brokers have "enough time" to update their topologies
> too.
> Although performing an additional quorum vote isn't a bad thing per-se, it
> could create an unnecessary long time window to await the observing cluster
> to update their topologies, slowing down an operation that is supposed
> instead to be completed quickly (on the happy path).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)