[
https://issues.apache.org/jira/browse/ARTEMIS-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362898#comment-17362898
]
Francesco Nigro edited comment on ARTEMIS-2716 at 6/14/21, 5:40 PM:
--------------------------------------------------------------------
I'm going to:
# *remove the initial loop on primary start*: a primary start should succeed
or fail (with errors) and it's key for admin purposes. Admins are supposed to
check broker/machine state before restarting, so it's not just an automated
operation, but need to be supervised
# *deprecate/document allow-failback*: allow-failback == false turn a
failing-back primary into a backup that can just error out on failover errors.
In the classic replication failing-back master forget its Node ID if any error
happen on failover and restart as an empty backup. On broker restart, it got a
different NodeID and become live.
The latter decision has been made to enforce what the primary role is meant to
be: mostly a live candidate and an occasional/temporary backup, ready to
failback ASAP.
A failure during the failback process it's perfectly fine to fail-fast given
that should be an all-or-nothing admin operation.
A failure during a proper failover (because backup acting as live has rejected
the initial failback request) is still uncertain which behaviour should follow:
* a natural-born backup just search for other lives to pair/sync with
* a primary is probably fine to just stop, because there is no point into
restarting as primary (and risking to become live with a misaligned journal) or
behaving like a natural-born backup ie the mentioned above behaviour
This change is debatable and we can open a discussion on the PR about it.
was (Author: nigrofranz):
I'm going to:
# *remove the initial loop on primary start*: a primary start should succeed
or fail (with errors) and it's key for admin purposes. Admins are supposed to
check broker/machine state before restarting, so it's not just an automated
operation, but need to be supervised
# *deprecate/document allow-failback*: allow-failback == false turn a
failing-back primary into a backup that can just error out on failover errors.
In the classic replication failing-back master forget its Node ID if any error
happen on failover and restart as an empty backup. On broker restart, it got a
different NodeID and become live.
The latter decision has been made to enforce what the primary role is meant to
be: mostly a live candidate and an occasional/temporary backup, ready to
failback ASAP.
A failure during the failback process it's perfectly fine to fail-fast given
that should be an all-or-nothing admin operation.
A failure during a proper failover (because of the backup has rejected the
initial failback request) is still uncertain which behaviour should follow:
* a natural-born backup just search for other lives to pair/sync with
* a primary is probably fine to just stop, because there is no point into
restarting as primary (and risking to become live with a misaligned journal) or
behaving like a natural-born backup ie the mentioned above behaviour
This change is debatable and we can open a discussion on the PR about it.
> Implements pluggable Quorum Vote
> --------------------------------
>
> Key: ARTEMIS-2716
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2716
> Project: ActiveMQ Artemis
> Issue Type: New Feature
> Reporter: Francesco Nigro
> Assignee: Francesco Nigro
> Priority: Major
> Attachments: backup.png, primary.png
>
> Time Spent: 16h
> Remaining Estimate: 0h
>
> This task aim to ideliver a new Quorum Vote mechanism for artemis with the
> objectives:
> # to make it pluggable
> # to cleanly separate the election phase and the cluster member states
> # to simplify most common setups in both amount of configuration and
> requirements (eg "witness" nodes could be implemented to support single
> master-slave pairs)
> Post-actions to help people adopt it, but need to be thought upfront:
> # a clean upgrade path for current HA replication users
> # deprecate or integrate the current HA replication into the new version
--
This message was sent by Atlassian Jira
(v8.3.4#803005)