[ 
https://issues.apache.org/jira/browse/ARTEMIS-2713?focusedWorklogId=423186&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-423186
 ]

ASF GitHub Bot logged work on ARTEMIS-2713:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Apr/20 01:56
            Start Date: 16/Apr/20 01:56
    Worklog Time Spent: 10m 
      Work Description: asfgit commented on pull request #3084: ARTEMIS-2713 
Master failback can trigger a useless quorum vote on slave failover
URL: https://github.com/apache/activemq-artemis/pull/3084
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 423186)
    Time Spent: 0.5h  (was: 20m)

> Master failback can trigger a useless quorum vote on slave failover
> -------------------------------------------------------------------
>
>                 Key: ARTEMIS-2713
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2713
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.11.0
>            Reporter: Francesco Nigro
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> A shared nothing replicated master-slave pair using check-for-live-server on 
> master and allow-failback on slave can trigger a (single or several) useless 
> quorum vote during master restart.
> The issue can happen depending on the timing by which some messages are 
> exchanged between the pair: the slave restarting as a backup perform these 
> operations:
> # async send STOP_CALLED on the connection with master used to send the 
> replica files (ie let's call it replication connection)
> # close all the connections with master, but the replication connection 
> (sending a DISCONNECT to the closing ones)
> # async send FAIL_OVER on the replication connection (waiting 5 seconds 
> before giving up and move on)
> # close the replication connection
> The master could receive the DISCONNECT before STOP_CALLED (because are 
> different connections!) believing that the slave isn't going down 
> intentionally: this will make it to fire vote-retries quorum vote. 
> Such quorum vote (in the happy path) should "quickly" complete positively, 
> making master able to fail-over anyway, because the slave is already moved on 
> and (ideally) the other brokers have "enough time" to update their topologies 
> too.
> Although performing an additional quorum vote isn't a bad thing per-se, it 
> could create an unnecessary long time window to await the observing cluster 
> to update their topologies, slowing down an operation that is supposed 
> instead to be completed quickly (on the happy path).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to