[
https://issues.apache.org/jira/browse/ARTEMIS-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Francesco Nigro updated ARTEMIS-3340:
-------------------------------------
Description:
Shared-nothing replication can cause journal misalignment despite no
split-brain events.
There are several ways that can cause this to happen.
Below some scenario that won't involve network partitions/drastic outages.
Scenario 1:
# Master/Primary start as live, clients connect to it
# Backup become an in-sync replica
# User stop live and backup failover to it
# *Backup serve clients alone, modifying its journal*
# User stop backup
# User start master/primary: it become live with a journal misaligned to the
most up-to-date one ie on the stopped backup
Scenario 2:
# Master/Primary start as live, clients connect to it
# Backup become an in-sync replica
# Connection glitch between backup -> live
# backup start trying to failover (for {{vote-retries * vote-retry-wait}}
milliseconds)
# *Live serve clients alone, modifying its journal*
# User stop live
# Backup succeed to failover: it become live with a journal misaligned to the
most up-to-date one ie on the stopped live
The main cause of this issue is because we allow *a single broker to serve
clients*, despite configured with HA, generating the journal misalignment.
The quorum service (classic or pluggable) just take care of mutual exclusive
presence of broker for the live role (vs a NodeID), without considering live
role ordering based on the most up-to-date journal.
A possible solution is to use
https://issues.apache.org/jira/browse/ARTEMIS-2716 and use a quorum "logical
timestamp" marking the age of the journal in order to force live to always have
the most up-to-date journal. It means that the same
In case of quorum service restart/outage, admin must use command/configuration
to let a broker to ignore the age of its journal and just force it to start.
In addition new journal CLI commands should be implemented to inspect the age
of a (local) broker journal or query/force the quorum journal version, for
troubleshooting reasons.
It's very important to capture every possible event that cause the journal age
to increase
eg
# live broker send its journal file to a not yet in sync replica backup, along
with its "journal age"
# backup is now ready to failover in any moment
# a network partition happen
# backup try to become live for vote-retries times
# live detect replication disconnection but is "lucky" that can reach the
quorum and continue serving clients
# live increment the age of its journal
# an outage cause live to die
# network partition is restored
# backup detect that journal age is no longer matching its own journal: it
stop trying to become live
The key parts related to journal age/version are:
* only who's live can change journal version (with a monotonic increment)
* every breaking point event must cause journal age/version to change eg
starting as live, loosing its backup, etc etc
Re the RI implementation using Apache Curator, this could use a separate
[DistributedAtomicLong|https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/atomic/DistributedAtomicLong.html]
to manage the journal version.
Although tempting, it's not a good idea to use the data field on
{{InterProcessSemaphoreV2}}, because:
* there's no API to query it if no lease is acquired yet (or created)
* we more need to "age" the journal independently from the lock
acquisition/release process eg a live that drop its replica need to increment
the journal version
Athough tempting, it's not a good idea to just use the last alive broker
connector identity instead of a journal version, because of the ABA problem
(see https://en.wikipedia.org/wiki/ABA_problem).
This versioning mechanism isn't without drawbacks: quorum journal versioning
requires to store a local copy of the version in order to allow the broker to
query and compare it with the quorum one on restart; having 2 separate and not
atomic operations means that there must be a way to reconcile/fix it in case of
misalignments.
This could be done with admin operations.
The versioning change the way roles behave, but they still retain theirs key
characteristics:
- backup can start as live in case of most up to date journal and no other live
around, but if not, can just rotate journal and be available to sync with a live
- primary try to failback to any existing live with the most up to date journal
or await it, without becoming live in case of old journal
This would ensure that If both broker are up and running and backup allow
primary to failback, the primary eventually become live and backup replicates
it.
was:
Shared-nothing replication can cause journal misalignment despite no
split-brain events.
Scenario without network partitions/outages:
# Master/Primary start as live, clients connect to it
# Backup become an in-sync replica
# User stop live and backup failover to it
# Backup serve clients, modifying its journal
# User stop backup
# User start master/primary: it become live with a journal misaligned to the
most up-to-date one ie on the stopped backup
The main cause of this issue is because we allow a single broker to serve
clients, despite configured with HA, generating the journal misalignment.
The quorum service (classic or pluggable) just take care of mutual exclusive
presence of broker for the live role (vs a NodeID), without considering live
role ordering ie last live alive: there's the need of a distributed agreement
on such (total) order.
A possible solution is to leverage on
https://issues.apache.org/jira/browse/ARTEMIS-2716 and store a "logical
timestamp" that mark the age of the journal in order to allow the one with the
most up-to-date one to become a proper live.
It means that in case of quorum service restart/outage, admin must use
command/configuration to let a broker to ignore the age of its journal and just
force it to start.
In addition must be exposed some new journal CLI commands to inspect the age
of a broker journal, for troubleshooting reasons.
It's very important to capture every possible event that cause the journal age
to increase
eg
# live broker send its journal file to a not yet in sync replica backup, along
with its "journal age"
# backup is now ready to failover in any moment
# a network partition happen
# backup try to become live for vote-retries times
# live detect replication disconnection but is "lucky" that can reach the
quorum and continue serving clients
# live increment the age of its journal
# an outage cause live to die
# network partition is restored
# backup detect that journal age is no longer matching its own journal: it
stop trying to become live
The key parts related to journal age/version are:
* only who's live can change journal version (with a monotonic increment)
* every breaking point event must cause journal age/version to change eg
starting as live, loosing its backup, etc etc
Re the RI implementation using Apache Curator, this could use a separate
[DistributedAtomicLong|https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/atomic/DistributedAtomicLong.html]
to manage the journal version.
Although tempting, it's not a good idea to use the data field on
{{InterProcessSemaphoreV2}}, because:
* there's no API to query it if no lease is acquired yet (or created)
* we more need to "age" the journal independently from the lock
acquisition/release process eg a live that drop its replica need to increment
the journal version
Athough tempting, it's not a good idea to just use the last alive broker
connector identity instead of a journal version, because of the ABA problem
(see https://en.wikipedia.org/wiki/ABA_problem).
This versioning mechanism isn't without drawbacks: quorum journal versioning
requires to store a local copy of the version in order to allow the broker to
query and compare it with the quorum one on restart; having 2 separate and not
atomic operations means that there must be a way to reconcile/fix it in case of
misalignments.
This could be done with admin operations.
The versioning change the way roles behave, but they still retain theirs key
characteristics:
- backup can start as live in case of most up to date journal and no other live
around, but if not, can just rotate journal and be available to sync with a live
- primary try to failback to any existing live with the most up to date journal
or await it, without becoming live in case of old journal
This would ensure that If both broker are up and running and backup allow
primary to failback, the primary eventually become live and backup replicates
it.
> Replicated Journal quorum-based logical timestamp
> -------------------------------------------------
>
> Key: ARTEMIS-3340
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3340
> Project: ActiveMQ Artemis
> Issue Type: Improvement
> Reporter: Francesco Nigro
> Priority: Major
>
> Shared-nothing replication can cause journal misalignment despite no
> split-brain events.
> There are several ways that can cause this to happen.
> Below some scenario that won't involve network partitions/drastic outages.
> Scenario 1:
> # Master/Primary start as live, clients connect to it
> # Backup become an in-sync replica
> # User stop live and backup failover to it
> # *Backup serve clients alone, modifying its journal*
> # User stop backup
> # User start master/primary: it become live with a journal misaligned to the
> most up-to-date one ie on the stopped backup
> Scenario 2:
> # Master/Primary start as live, clients connect to it
> # Backup become an in-sync replica
> # Connection glitch between backup -> live
> # backup start trying to failover (for {{vote-retries * vote-retry-wait}}
> milliseconds)
> # *Live serve clients alone, modifying its journal*
> # User stop live
> # Backup succeed to failover: it become live with a journal misaligned to
> the most up-to-date one ie on the stopped live
> The main cause of this issue is because we allow *a single broker to serve
> clients*, despite configured with HA, generating the journal misalignment.
> The quorum service (classic or pluggable) just take care of mutual exclusive
> presence of broker for the live role (vs a NodeID), without considering live
> role ordering based on the most up-to-date journal.
> A possible solution is to use
> https://issues.apache.org/jira/browse/ARTEMIS-2716 and use a quorum "logical
> timestamp" marking the age of the journal in order to force live to always
> have the most up-to-date journal. It means that the same
> In case of quorum service restart/outage, admin must use
> command/configuration to let a broker to ignore the age of its journal and
> just force it to start.
> In addition new journal CLI commands should be implemented to inspect the age
> of a (local) broker journal or query/force the quorum journal version, for
> troubleshooting reasons.
> It's very important to capture every possible event that cause the journal
> age to increase
> eg
> # live broker send its journal file to a not yet in sync replica backup,
> along with its "journal age"
> # backup is now ready to failover in any moment
> # a network partition happen
> # backup try to become live for vote-retries times
> # live detect replication disconnection but is "lucky" that can reach the
> quorum and continue serving clients
> # live increment the age of its journal
> # an outage cause live to die
> # network partition is restored
> # backup detect that journal age is no longer matching its own journal: it
> stop trying to become live
> The key parts related to journal age/version are:
> * only who's live can change journal version (with a monotonic increment)
> * every breaking point event must cause journal age/version to change eg
> starting as live, loosing its backup, etc etc
>
> Re the RI implementation using Apache Curator, this could use a separate
> [DistributedAtomicLong|https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/atomic/DistributedAtomicLong.html]
> to manage the journal version.
> Although tempting, it's not a good idea to use the data field on
> {{InterProcessSemaphoreV2}}, because:
> * there's no API to query it if no lease is acquired yet (or created)
> * we more need to "age" the journal independently from the lock
> acquisition/release process eg a live that drop its replica need to increment
> the journal version
> Athough tempting, it's not a good idea to just use the last alive broker
> connector identity instead of a journal version, because of the ABA problem
> (see https://en.wikipedia.org/wiki/ABA_problem).
> This versioning mechanism isn't without drawbacks: quorum journal versioning
> requires to store a local copy of the version in order to allow the broker to
> query and compare it with the quorum one on restart; having 2 separate and
> not atomic operations means that there must be a way to reconcile/fix it in
> case of misalignments.
> This could be done with admin operations.
> The versioning change the way roles behave, but they still retain theirs key
> characteristics:
> - backup can start as live in case of most up to date journal and no other
> live around, but if not, can just rotate journal and be available to sync
> with a live
> - primary try to failback to any existing live with the most up to date
> journal or await it, without becoming live in case of old journal
> This would ensure that If both broker are up and running and backup allow
> primary to failback, the primary eventually become live and backup replicates
> it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)