GitHub user jbertram opened a pull request:
https://github.com/apache/activemq-artemis/pull/204
ARTEMIS-256 orchestrate failback deterministically
The failback process needs to be deterministic rather than relying on
various
incarnations of Thread.sleep() at crucial points. Important aspects of this
change include:
1) Make the initial replication synchronization process block at the very
last step and wait for a response from the replica to ensure the replica has
as the necessary data. This is a critical piece of knowledge during the
failback process because it allows the soon-to-become-backup server to know
for sure when it can shut itself down and allow the soon-to-become-live
server to take over. Also, introduce a new configuration element called
"initial-replication-sync-timeout" to conrol how long this blocking will
occur.
2) Set the state of the server as 'LIVE' only after the server is fully
started. This is necessary because once the soon-to-be-backup server shuts
down it needs to know that the soon-to-be-live server has started fully
before
it restarts itself as the new backup. If the soon-to-be-backup server
restarts
before the soon-to-be-live is fully started then it won't actually become a
backup server but instead will become a live server which will break the
failback process.
3) Wait to receive the announcement of a backup server before failing-back.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jbertram/activemq-artemis ARTEMIS-256
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/activemq-artemis/pull/204.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #204
----
commit 908776eff3e5410b851aaaa1f62f7db764187acc
Author: jbertram <[email protected]>
Date: 2015-10-14T17:07:17Z
ARTEMIS-256 orchestrate failback deterministically
The failback process needs to be deterministic rather than relying on
various
incarnations of Thread.sleep() at crucial points. Important aspects of this
change include:
1) Make the initial replication synchronization process block at the very
last step and wait for a response from the replica to ensure the replica has
as the necessary data. This is a critical piece of knowledge during the
failback process because it allows the soon-to-become-backup server to know
for sure when it can shut itself down and allow the soon-to-become-live
server to take over. Also, introduce a new configuration element called
"initial-replication-sync-timeout" to conrol how long this blocking will
occur.
2) Set the state of the server as 'LIVE' only after the server is fully
started. This is necessary because once the soon-to-be-backup server shuts
down it needs to know that the soon-to-be-live server has started fully
before
it restarts itself as the new backup. If the soon-to-be-backup server
restarts
before the soon-to-be-live is fully started then it won't actually become a
backup server but instead will become a live server which will break the
failback process.
3) Wait to receive the announcement of a backup server before failing-back.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---