[ 
https://issues.apache.org/jira/browse/ARTEMIS-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francesco Nigro updated ARTEMIS-2713:
-------------------------------------
    Description: 
A shared nothing replicated master-slave pair using check-for-live-server on 
master and allow-failback on slave can trigger a (single or several) useless 
quorum vote during master restart.
The issue can happen depending on the timing by which some messages are 
exchanged between the pair: specifically the slave, while restarting as a 
backup, will perform these operations:
# async send STOP_CALLED on the connection with master used to send the replica 
files (ie let's call it replication connection)
# close all the connections with master, but the replication connection 
(sending a DISCONNECT to the closing ones)
# async send FAIL_OVER on the replication connection (waiting 5 seconds before 
giving up and move on)
# close the replication connection

The master, in order to restart as live, could receive the DISCONNECT before 
STOP_CALLED, believing that the slave isn't going down intentionally: this will 
make it to fire vote-retries quorum vote. 
Such quorum vote (in the happy path) will be soon complete positively, making 
master able to fail-over anyway, because the slave is already moved on and 
(ideally) the other brokers have "enough time" to update their topologies too.

Although performing an additional quorum vote isn't a bad thing per-se, it 
could create an unnecessary long time window to await the observing cluster to 
update their topologies, slowing down an operation that is supposed instead to 
be completed quickly (in the happy path).

  was:
A shared nothing replicated master-slave pair using check-for-live-server on 
master and allow-failback on slave can trigger a (single or several) useless 
quorum vote during master restart.
The issue can happen depending on the timing by which some messages are 
exchanged between the pair: specifically the slave, while restarting as a 
backup, will perform these operations:
# async send STOP_CALLED on the connection with master used to send the replica 
files (ie let's call it replication connection)
# close all the connections with master, but the replication connection 
(sending a DISCONNECT to the closing ones)
# async send FAIL_OVER on the replication connection (waiting 5 seconds before 
giving up and move on)
# close the replication connection

The master, in order to restart as live, could receive the DISCONNECT before 
STOP_CALLED, believing that the slave isn't going down intentionally: this will 
make it to fire vote-retries quorum vote. 
Such quorum vote (in the happy path) will be positives and will make master to 
fail-over anyway, because the slave is already moved on and (ideally) the other 
brokers have "enough time" to update their topologies too.

Although performing an additional quorum vote isn't a bad thing per-se, it 
could create an unnecessary long time window to await the observing cluster to 
update their topologies, slowing down an operation that is supposed instead to 
be completed quickly (in the happy path).


> Master failback can trigger a useless quorum vote on slave failover
> -------------------------------------------------------------------
>
>                 Key: ARTEMIS-2713
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2713
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.11.0
>            Reporter: Francesco Nigro
>            Priority: Major
>
> A shared nothing replicated master-slave pair using check-for-live-server on 
> master and allow-failback on slave can trigger a (single or several) useless 
> quorum vote during master restart.
> The issue can happen depending on the timing by which some messages are 
> exchanged between the pair: specifically the slave, while restarting as a 
> backup, will perform these operations:
> # async send STOP_CALLED on the connection with master used to send the 
> replica files (ie let's call it replication connection)
> # close all the connections with master, but the replication connection 
> (sending a DISCONNECT to the closing ones)
> # async send FAIL_OVER on the replication connection (waiting 5 seconds 
> before giving up and move on)
> # close the replication connection
> The master, in order to restart as live, could receive the DISCONNECT before 
> STOP_CALLED, believing that the slave isn't going down intentionally: this 
> will make it to fire vote-retries quorum vote. 
> Such quorum vote (in the happy path) will be soon complete positively, making 
> master able to fail-over anyway, because the slave is already moved on and 
> (ideally) the other brokers have "enough time" to update their topologies too.
> Although performing an additional quorum vote isn't a bad thing per-se, it 
> could create an unnecessary long time window to await the observing cluster 
> to update their topologies, slowing down an operation that is supposed 
> instead to be completed quickly (in the happy path).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to