[ 
https://issues.apache.org/jira/browse/ARTEMIS-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362898#comment-17362898
 ] 

Francesco Nigro edited comment on ARTEMIS-2716 at 6/14/21, 6:18 PM:
--------------------------------------------------------------------

I'm going to:
 # *remove the initial loop on primary start*: a primary start should succeed 
or fail (with errors) and it's key for admin purposes. Admins are supposed to 
check broker/machine state before restarting, so it's not just an automated 
operation, but need to be supervised
 # *deprecate/document allow-failback*: allow-failback == false turn a 
failing-back primary into a backup that can just error out on failover errors.

The latter decision has been made to enforce what the primary role is meant to 
be: a live candidate and an occasional/temporary backup, ready to failback 
ASAP. The allow-failback == false use case should be used to perform a manual 
failback (by restarting backup acting as live), but NOT to let a primary to 
become a natural-born long-living backup: this because the primary itself has 
been (re)started due to a manual intervention after a previous outage.

Right now no automatic primary restarts are safe to happen due to the journal 
misalignment issue explained on 
https://issues.apache.org/jira/browse/ARTEMIS-3340.

A failure during the failback process it's perfectly fine to fail-fast given 
that should be an all-or-nothing admin operation.

A failure during a failover (because the backup acting as live has rejected the 
initial failback request) is still uncertain which behaviour should follow:
 * a natural-born backup would just search for other lives to pair/sync with, 
because as backup, it's supposed to help other brokers
 * a primary is probably fine to just stop, because there is no point into 
restarting as primary (risking to become live with a misaligned journal) or 
behaving like a natural-born long-living backup ie the mentioned above behaviour

 This change is debatable and we can open a discussion on the PR about it.

An alternative behavior could be:
 * a failed scheduled fail-back can still fail-fast ie primary just go down, to 
save a long-running supervised admin restart to happen
 * a failed proper failover (ie un-scheduled failback) can make the primary to 
retry searching for any live to help again ie acting as a backup, but still 
aware of its node ID

 

Just a tech node: the latter behavior could be achieved by restarting primary 
as a primary again, but that await forever for a live server with the same node 
ID to appear: this is to NOT loose the failback information needed during the 
activation as failing-back backup of primary (it's an implementation detail of 
how activation work on Artemis server start).


was (Author: nigrofranz):
I'm going to:
 # *remove the initial loop on primary start*: a primary start should succeed 
or fail (with errors) and it's key for admin purposes. Admins are supposed to 
check broker/machine state before restarting, so it's not just an automated 
operation, but need to be supervised
 # *deprecate/document allow-failback*: allow-failback == false turn a 
failing-back primary into a backup that can just error out on failover errors.

The latter decision has been made to enforce what the primary role is meant to 
be: a live candidate and an occasional/temporary backup, ready to failback 
ASAP. The allow-failback == false use case should be used to perform a manual 
failback (by restarting backup acting as live), but NOT to let a primary to 
become a natural-born long-living backup: this because the primary itself has 
been (re)started due to a manual intervention after a previous outage.

Right now no automatic primary restarts are safe to happen due to the journal 
misalignment issue explained on 
https://issues.apache.org/jira/browse/ARTEMIS-3340.

A failure during the failback process it's perfectly fine to fail-fast given 
that should be an all-or-nothing admin operation.

A failure during a failover (because the backup acting as live has rejected the 
initial failback request) is still uncertain which behaviour should follow:
 * a natural-born backup would just search for other lives to pair/sync with, 
because as backup, it's supposed to help other brokers
 * a primary is probably fine to just stop, because there is no point into 
restarting as primary (risking to become live with a misaligned journal) or 
behaving like a natural-born long-living backup ie the mentioned above behaviour

 This change is debatable and we can open a discussion on the PR about it.

An alternative behavior could be:
 * a failed scheduled fail-back can still fail-fast ie primary just go down, to 
save a long-running supervised admin restart to happen
 * a failed proper failover (ie un-scheduled failback) can make the primary to 
retry searching for any live to help again ie acting as a backup, but still 
aware of its node ID

> Implements pluggable Quorum Vote
> --------------------------------
>
>                 Key: ARTEMIS-2716
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2716
>             Project: ActiveMQ Artemis
>          Issue Type: New Feature
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Major
>         Attachments: backup.png, primary.png
>
>          Time Spent: 16h
>  Remaining Estimate: 0h
>
> This task aim to ideliver a new Quorum Vote mechanism for artemis with the 
> objectives:
> # to make it pluggable
> # to cleanly separate the election phase and the cluster member states
> # to simplify most common setups in both amount of configuration and 
> requirements (eg "witness" nodes could be implemented to support single 
> master-slave pairs)
> Post-actions to help people adopt it, but need to be thought upfront:
> # a clean upgrade path for current HA replication users
> # deprecate or integrate the current HA replication into the new version



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to