[
https://issues.apache.org/jira/browse/ARTEMIS-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362898#comment-17362898
]
Francesco Nigro edited comment on ARTEMIS-2716 at 6/14/21, 6:18 PM:
--------------------------------------------------------------------
I'm going to:
# *remove the initial loop on primary start*: a primary start should succeed
or fail (with errors) and it's key for admin purposes. Admins are supposed to
check broker/machine state before restarting, so it's not just an automated
operation, but need to be supervised
# *deprecate/document allow-failback*: allow-failback == false turn a
failing-back primary into a backup that can just error out on failover errors.
The latter decision has been made to enforce what the primary role is meant to
be: a live candidate and an occasional/temporary backup, ready to failback
ASAP. The allow-failback == false use case should be used to perform a manual
failback (by restarting backup acting as live), but NOT to let a primary to
become a natural-born long-living backup: this because the primary itself has
been (re)started due to a manual intervention after a previous outage.
Right now no automatic primary restarts are safe to happen due to the journal
misalignment issue explained on
https://issues.apache.org/jira/browse/ARTEMIS-3340.
A failure during the failback process it's perfectly fine to fail-fast given
that should be an all-or-nothing admin operation.
A failure during a failover (because the backup acting as live has rejected the
initial failback request) is still uncertain which behaviour should follow:
* a natural-born backup would just search for other lives to pair/sync with,
because as backup, it's supposed to help other brokers
* a primary is probably fine to just stop, because there is no point into
restarting as primary (risking to become live with a misaligned journal) or
behaving like a natural-born long-living backup ie the mentioned above behaviour
This change is debatable and we can open a discussion on the PR about it.
An alternative behavior could be:
* a failed scheduled fail-back can still fail-fast ie primary just go down, to
save a long-running supervised admin restart to happen
* a failed proper failover (ie un-scheduled failback) can make the primary to
retry searching for any live to help again ie acting as a backup, but still
aware of its node ID
Just a tech node: the latter behavior could be achieved by restarting primary
as a primary again, but that await forever for a live server with the same node
ID to appear: this is to NOT loose the failback information needed during the
activation as failing-back backup of primary (it's an implementation detail of
how activation work on Artemis server start).
was (Author: nigrofranz):
I'm going to:
# *remove the initial loop on primary start*: a primary start should succeed
or fail (with errors) and it's key for admin purposes. Admins are supposed to
check broker/machine state before restarting, so it's not just an automated
operation, but need to be supervised
# *deprecate/document allow-failback*: allow-failback == false turn a
failing-back primary into a backup that can just error out on failover errors.
The latter decision has been made to enforce what the primary role is meant to
be: a live candidate and an occasional/temporary backup, ready to failback
ASAP. The allow-failback == false use case should be used to perform a manual
failback (by restarting backup acting as live), but NOT to let a primary to
become a natural-born long-living backup: this because the primary itself has
been (re)started due to a manual intervention after a previous outage.
Right now no automatic primary restarts are safe to happen due to the journal
misalignment issue explained on
https://issues.apache.org/jira/browse/ARTEMIS-3340.
A failure during the failback process it's perfectly fine to fail-fast given
that should be an all-or-nothing admin operation.
A failure during a failover (because the backup acting as live has rejected the
initial failback request) is still uncertain which behaviour should follow:
* a natural-born backup would just search for other lives to pair/sync with,
because as backup, it's supposed to help other brokers
* a primary is probably fine to just stop, because there is no point into
restarting as primary (risking to become live with a misaligned journal) or
behaving like a natural-born long-living backup ie the mentioned above behaviour
This change is debatable and we can open a discussion on the PR about it.
An alternative behavior could be:
* a failed scheduled fail-back can still fail-fast ie primary just go down, to
save a long-running supervised admin restart to happen
* a failed proper failover (ie un-scheduled failback) can make the primary to
retry searching for any live to help again ie acting as a backup, but still
aware of its node ID
> Implements pluggable Quorum Vote
> --------------------------------
>
> Key: ARTEMIS-2716
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2716
> Project: ActiveMQ Artemis
> Issue Type: New Feature
> Reporter: Francesco Nigro
> Assignee: Francesco Nigro
> Priority: Major
> Attachments: backup.png, primary.png
>
> Time Spent: 16h
> Remaining Estimate: 0h
>
> This task aim to ideliver a new Quorum Vote mechanism for artemis with the
> objectives:
> # to make it pluggable
> # to cleanly separate the election phase and the cluster member states
> # to simplify most common setups in both amount of configuration and
> requirements (eg "witness" nodes could be implemented to support single
> master-slave pairs)
> Post-actions to help people adopt it, but need to be thought upfront:
> # a clean upgrade path for current HA replication users
> # deprecate or integrate the current HA replication into the new version
--
This message was sent by Atlassian Jira
(v8.3.4#803005)