[ 
https://issues.apache.org/jira/browse/ARTEMIS-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francesco Nigro updated ARTEMIS-3340:
-------------------------------------
    Description: 
Shared-nothing replication can cause journal misalignment despite no 
split-brain events.

Scenario without network partitions/outages:
 # Master/Primary start as live, clients connect to it
 # Backup become an in-sync replica
 # User stop live and backup failover to it
 # Backup serve clients, modifying its journal
 # User stop backup
 # User start master/primary: it become live with a journal misaligned to the 
most up-to-date one ie on the stopped backup

The main cause of this issue is because we allow a single broker to serve 
clients, despite configured with HA, generating the journal misalignment.
 Given that quorum service (classic or pluggable) just take care of mutual 
exclusive presence of broker for the live role (vs a NodeID), without 
considering live role ordering ie last live alive: there's the need of a 
distributed agreement on such (total) order.

A possible solution is to leverage on 
https://issues.apache.org/jira/browse/ARTEMIS-2716 and store a "logical 
timestamp" that mark the age of the journal in order to allow the one with the 
most up-to-date one to become a proper live.

It means that in case of quorum service restart/outage, admin must use 
command/configuration to let a broker to ignore the age of its journal and just 
force it to start.
 In addition must be exposed some new journal CLI commands to inspect the age 
of a broker journal, for troubleshooting reasons.

It's very important to capture every possible event that cause the journal age 
to increase
 eg
 # live broker send its journal file to a not yet in sync replica backup, along 
with its "journal age"
 # backup is now ready to failover in any moment
 # a network partition happen
 # backup try to become live for vote-retries times
 # live detect replication disconnection but is "lucky" that can reach the 
quorum and continue serving clients
 # live increment the age of its journal
 # an outage cause live to die
 # network partition is restored
 # backup detect that journal age is no longer matching its own journal: it 
stop trying to become live

The key parts related to journal age/version are:
 * only who's live can change journal version (with a monotonic increment)
 * every breaking point event must cause journal age/version to change eg 
starting as live, loosing its backup, etc etc

 

Re the RI implementation using Apache Curator, this could use a separate 
[DistributedAtomicLong|https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/atomic/DistributedAtomicLong.html]
  to manage the journal version.

Although tempting, it's not a good idea to use the data field on 
{{InterProcessSemaphoreV2}}, because:
* there's no API to query it if no lease is acquired yet (or created)
* we more need to "age" the journal independently from the lock 
acquisition/release process eg a live that drop its replica need to increment 
the journal version

Athough tempting, it's not a good idea to just use the last alive broker 
connector identity instead of a journal version, because of the ABA problem 
(see https://en.wikipedia.org/wiki/ABA_problem).

  was:
Shared-nothing replication can cause journal misalignment despite no 
split-brain events.

Scenario without network partitions/outages:
 # Master/Primary start as live, clients connect to it
 # Backup become an in-sync replica
 # User stop live and backup failover to it
 # Backup serve clients, modifying its journal
 # User stop backup
 # User start master/primary: it become live with a journal misaligned with the 
most up-to-date one ie on the stopped backup

The main cause of this issue is because we allow a single broker to serve 
clients, despite configured with HA, generating the journal misalignment.
 Given that quorum service (classic or pluggable) just take care of mutual 
exclusive presence of broker for the live role (vs a NodeID), without 
considering live role ordering ie last live alive: there's the need of a 
distributed agreement on such (total) order.

A possible solution is to leverage on 
https://issues.apache.org/jira/browse/ARTEMIS-2716 and store a "logical 
timestamp" that mark the age of the journal in order to allow the one with the 
most up-to-date one to become a proper live.

It means that in case of quorum service restart/outage, admin must use 
command/configuration to let a broker to ignore the age of its journal and just 
force it to start.
 In addition must be exposed some new journal CLI commands to inspect the age 
of a broker journal, for troubleshooting reasons.

It's very important to capture every possible event that cause the journal age 
to increase
 eg
 # live broker send its journal file to a not yet in sync replica backup, along 
with its "journal age"
 # backup is now ready to failover in any moment
 # a network partition happen
 # backup try to become live for vote-retries times
 # live detect replication disconnection but is "lucky" that can reach the 
quorum and continue serving clients
 # live increment the age of its journal
 # an outage cause live to die
 # network partition is restored
 # backup detect that journal age is no longer matching its own journal: it 
stop trying to become live

The key parts related to journal age/version are:
 * only who's live can change journal version (with a monotonic increment)
 * every breaking point event must cause journal age/version to change eg 
starting as live, loosing its backup, etc etc

 

Re the RI implementation using Apache Curator, this could use a separate 
[DistributedAtomicLong|https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/atomic/DistributedAtomicLong.html]
  to manage the journal version.

Although tempting, it's not a good idea to use the data field on 
{{InterProcessSemaphoreV2}}, because:
* there's no API to query it if no lease is acquired yet (or created)
* we more need to "age" the journal independently from the lock 
acquisition/release process eg a live that drop its replica need to increment 
the journal version

Athough tempting, it's not a good idea to just use the last alive broker 
connector identity instead of a journal version, because of the ABA problem 
(see https://en.wikipedia.org/wiki/ABA_problem).


> Replicated Journal quorum-based logical timestamp
> -------------------------------------------------
>
>                 Key: ARTEMIS-3340
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3340
>             Project: ActiveMQ Artemis
>          Issue Type: Improvement
>            Reporter: Francesco Nigro
>            Priority: Major
>
> Shared-nothing replication can cause journal misalignment despite no 
> split-brain events.
> Scenario without network partitions/outages:
>  # Master/Primary start as live, clients connect to it
>  # Backup become an in-sync replica
>  # User stop live and backup failover to it
>  # Backup serve clients, modifying its journal
>  # User stop backup
>  # User start master/primary: it become live with a journal misaligned to the 
> most up-to-date one ie on the stopped backup
> The main cause of this issue is because we allow a single broker to serve 
> clients, despite configured with HA, generating the journal misalignment.
>  Given that quorum service (classic or pluggable) just take care of mutual 
> exclusive presence of broker for the live role (vs a NodeID), without 
> considering live role ordering ie last live alive: there's the need of a 
> distributed agreement on such (total) order.
> A possible solution is to leverage on 
> https://issues.apache.org/jira/browse/ARTEMIS-2716 and store a "logical 
> timestamp" that mark the age of the journal in order to allow the one with 
> the most up-to-date one to become a proper live.
> It means that in case of quorum service restart/outage, admin must use 
> command/configuration to let a broker to ignore the age of its journal and 
> just force it to start.
>  In addition must be exposed some new journal CLI commands to inspect the age 
> of a broker journal, for troubleshooting reasons.
> It's very important to capture every possible event that cause the journal 
> age to increase
>  eg
>  # live broker send its journal file to a not yet in sync replica backup, 
> along with its "journal age"
>  # backup is now ready to failover in any moment
>  # a network partition happen
>  # backup try to become live for vote-retries times
>  # live detect replication disconnection but is "lucky" that can reach the 
> quorum and continue serving clients
>  # live increment the age of its journal
>  # an outage cause live to die
>  # network partition is restored
>  # backup detect that journal age is no longer matching its own journal: it 
> stop trying to become live
> The key parts related to journal age/version are:
>  * only who's live can change journal version (with a monotonic increment)
>  * every breaking point event must cause journal age/version to change eg 
> starting as live, loosing its backup, etc etc
>  
> Re the RI implementation using Apache Curator, this could use a separate 
> [DistributedAtomicLong|https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/atomic/DistributedAtomicLong.html]
>   to manage the journal version.
> Although tempting, it's not a good idea to use the data field on 
> {{InterProcessSemaphoreV2}}, because:
> * there's no API to query it if no lease is acquired yet (or created)
> * we more need to "age" the journal independently from the lock 
> acquisition/release process eg a live that drop its replica need to increment 
> the journal version
> Athough tempting, it's not a good idea to just use the last alive broker 
> connector identity instead of a journal version, because of the ABA problem 
> (see https://en.wikipedia.org/wiki/ABA_problem).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to