[jira] [Comment Edited] (ARTEMIS-2930) Artemis HA with Replication strategy, has always issue of data loss

Justin Bertram (Jira) Mon, 12 Oct 2020 20:17:40 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212806#comment-17212806
 ]


Justin Bertram edited comment on ARTEMIS-2930 at 10/13/20, 3:16 AM:
--------------------------------------------------------------------

The first thing to note is that you're reading documentation from the 1.0.0 
release. You can see the version in the URL, i.e. 
[https://activemq.apache.org/components/artemis/documentation/*1.0.0*/ha.html|https://activemq.apache.org/components/artemis/documentation/1.0.0/ha.html].
 I assume you're not actually using the 1.0.0 release. You should be reading to 
the documentation that corresponds to the version you're using. You can always 
find the latest documentation at 
[https://activemq.apache.org/components/artemis/documentation/*latest*/|http://activemq.apache.org/components/artemis/documentation/latest/].

Next, you quote this bit of documentation:

bq. Replication will create a copy of the data at the backup. One issue to be 
aware of is: in case of a successful fail-over, the backup's data will be newer 
than the one at the live's storage. If you configure your live server to 
perform a failback to live server when restarted, it will synchronize its data 
with the backup's. If both servers are shutdown, the administrator will have to 
determine which one has the latest data.

This is not really talking about "data loss." It is simply drawing a 
distinction between the behavior of shared-store and replication. Hopefully I 
can explain in a way you can understand...

When using shared storage the live and backup brokers always have direct access 
to the most up-to-date journal because it's sitting on the shared storage 
device. Therefore, when you restart a live broker after that live broker has 
failed and the backup has started then that live broker can simply initiate a 
fail-back and connect to the shared store to have the most up-to-date data. 

However, when using replication the data has to be physically replicated 
between brokers. Therefore, when you restart a live broker after that live 
broker has failed and the backup has started then the live broker has to become 
a backup to the existing live server, receive the replicated journal data, and 
only then can it initiate fail-back.

If a live broker fails and the backup starts and then later the backup broker 
fails before the original live broker is restarted then the an administrator 
will have to inspect the broker's log files to determine which broker was alive 
most recently because that's the broker that will have the most up-to-date 
data. The broker with the most up-to-date data will need to be started _first_ 
so that it can become live and serve clients with the data they expect. If the 
broker with stale data is started first and it becomes live and then the other 
broker starts and becomes its backup the stale data will be replicated to the 
backup, but (and here's a very important bit) the backup's original up-to-date 
data *will not be lost*. It will be put into a special backup directory. This 
is controlled by the {{max-saved-replicated-journals-size}} configuration 
property discussed in the documentation.

As far as my previous explanation goes, that specific text is not in the 
documentation although the general idea is. The whole point of "high 
availability" in general and replication in particular is to *not lose 
messages*. The documentation doesn't really dive into implementation details 
because those details are subject to change even when the actual function 
remains the same. Ultimately if you want confidence about how HA works you 
should inspect the code-base to see how it works and then run experiments to 
ensure it behaves the way you expect for the use-cases you care about.


was (Author: jbertram):
The first thing to note is that you're reading documentation from the 1.0.0 
release. You can see the version in the URL, i.e. 
https://activemq.apache.org/components/artemis/documentation/*1.0.0*/ha.html. I 
assume you're not actually using the 1.0.0 release. You should be reading to 
the documentation that corresponds to the version you're using. You can always 
find the latest documentation at 
http://activemq.apache.org/components/artemis/documentation/*latest*/.

Next, you quote this bit of documentation:

bq. Replication will create a copy of the data at the backup. One issue to be 
aware of is: in case of a successful fail-over, the backup's data will be newer 
than the one at the live's storage. If you configure your live server to 
perform a failback to live server when restarted, it will synchronize its data 
with the backup's. If both servers are shutdown, the administrator will have to 
determine which one has the latest data.

This is not really talking about "data loss." It is simply drawing a 
distinction between the behavior of shared-store and replication. Hopefully I 
can explain in a way you can understand...

When using shared storage the live and backup brokers always have direct access 
to the most up-to-date journal because it's sitting on the shared storage 
device. Therefore, when you restart a live broker after that live broker has 
failed and the backup has started then that live broker can simply initiate a 
fail-back and connect to the shared store to have the most up-to-date data. 

However, when using replication the data has to be physically replicated 
between brokers. Therefore, when you restart a live broker after that live 
broker has failed and the backup has started then the live broker has to become 
a backup to the existing live server, receive the replicated journal data, and 
only then can it initiate fail-back.

If a live broker fails and the backup starts and then later the backup broker 
fails before the original live broker is restarted then the an administrator 
will have to inspect the broker's log files to determine which broker was alive 
most recently because that's the broker that will have the most up-to-date 
data. The broker with the most up-to-date data will need to be started _first_ 
so that it can become live and serve clients with the data they expect. If the 
broker with stale data is started first and it becomes live and then the other 
broker starts and becomes its backup the stale data will be replicated to the 
backup, but (and here's a very important bit) the backup's original up-to-date 
data *will not be lost*. It will be put into a special backup directory. This 
is controlled by the {{max-saved-replicated-journals-size}} configuration 
property discussed in the documentation.

As far as my previous explanation goes, that specific text is not in the 
documentation although the general idea is. The whole point of "high 
availability" in general and replication in particular is to *not lose 
messages*. The documentation doesn't really dive into implementation details 
because those details are subject to change even when the actual function 
remains the same. Ultimately if you want confidence about how HA works you 
should inspect the code-base to see how it works and then run experiments to 
ensure it behaves the way you expect for the use-cases you care about.

> Artemis HA with Replication strategy, has always issue of data loss 
> --------------------------------------------------------------------
>
>                 Key: ARTEMIS-2930
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2930
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>            Reporter: Karan Aggarwal
>            Priority: Major
>
> In the documentation I read that the HA with replication strategy, the slave 
> node keeps polling the new data at a regular interval.
> So, there is 100% chance that delta messages are lost if the master server is 
> down.
>  
> How to overcome this issue and ensure that there is no data loss in any 
> condition while using HA replication strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARTEMIS-2930) Artemis HA with Replication strategy, has always issue of data loss

Reply via email to