[jira] [Comment Edited] (ARTEMIS-4276) Message Group does not replicate properly during failover

Liviu Citu (Jira) Wed, 17 May 2023 04:40:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723419#comment-17723419
 ]


Liviu Citu edited comment on ARTEMIS-4276 at 5/17/23 11:39 AM:
---------------------------------------------------------------

The question is more like in the context of dealing with message duplication 
when grouping is being used during failover switch.

Let me provide some more details to better understand the business case.

Suppose we have a gateway interfaces with an external system to import 
transactions into the database. Such interface consists of two main components:
 * {*}gateway adapter server (producer){*}: *receives* messages from the 
external systems using some APIs and *puts* them on a specific JMS topic
 * {*}gateway loader server (consumer){*}: *consumes* messages from the adapter 
JMS topic, does some processing and *saves* transaction into the database

As the processing is time consuming and the message volumes is very high then 
we have to *balance the gateway loader server* (two or more loader 
servers/consumers can be configured to listen to the same topic. We are using 
*virtual topics* for that.

These external transactions have versioning so we need to ensure that they are 
processed in a specific order (actually in the order they are received). To 
ensure that we are using *JMSXGroupID* which will identity the transaction 
without its version. By using grouping we ensure that the same consumer will 
process all versions of the same transaction. 

Assume external transaction is identified by 
*ExternalSystem+ExternalType+ExternalID.* Thee gateway adapter will set 
*JMSXGroupID* to this value in the JMS message before sending it to the topic. 
If a new version of the same transaction is received from external system then 
the same *JMSXGroupID*  will be set in the message.

Practical example:

*EXT_SWAP_ID* with version *1* will have *JMSXGroupID=EXT_SWAP_ID*

*EXT_SWAP_ID* with version *2* will have *JMSXGroupID=EXT_SWAP_ID*

*EXT_BOND_ID* with version *1* will have *JMSXGroupID=EXT_BOND_ID*

*EXT_BOND_ID* with version *2* will have *JMSXGroupID=EXT_BOND_ID*

*EXT_BOND_ID* with version *3* will have *JMSXGroupID=EXT_BOND_ID*

Let's assume we have two loaders (consumers): *LDR1* and *LDR2* .

Prior to failover we know that:

*LDR1* have processed all messages having {*}JMSXGroupID={*}{*}EXT_SWAP_ID{*}

*LDR2* have processed all messages having {*}JMSXGroupID={*}{*}EXT_BOND_ID{*}

Just before doing failover switch, we have received two new transactions:

*EXT_SWAP_ID* with version *3* ({*}JMSXGroupID=EXT_SWAP_ID){*}

*EXT_BOND_ID* with version {*}4 ({*}{*}JMSXGroupID=EXT_BOND_ID){*}

*LDR1* and *LDR2* were able to process the transactions meaning:

*LDR1* has processed *EXT_SWAP_ID* with version *3*

*LDR2* has processed *EXT_BOND_ID* with version *4*

However *broker was not able to receive message acknowledgement due to network 
interruption (failover switch).* After the broker is back online it sends again 
the two messages to its consumers.

To handle a message duplication all our consumer's listeners are using a *LRU* 
(last recently used) cache of the already processed messages. So if a same 
message is being received twice then it will be skipped. Therefore:

if *LDR1* will receive again  *EXT_SWAP_ID* with version *3* will skip it.

if *LDR2* will receive again *EXT_BOND_ID* with version *4* will skip it.

However, the problem is that after failover switch the transactions are 
received the other way around:

*LDR1* received *EXT_BOND_ID* with version *4* 

*LDR2* received *EXT_SWAP_ID* with version *3*

*DR1* and *LDR2* being separate processes they cannot share their *LRU* cache 
as this is an in-memory cache. Therefore these messages are considered new to 
the loaders because they are not part of their LRU cache and hence loaders will 
try to process these transactions.  

This leads to the same transaction being imported in the database twice and 
causing several other issues in our application. Actually these transactions 
re-import might fail entirely and in some cases will cause both *LDR1* and 
*LDR2* to malfunction properly.

Is there any setup to circumvent this? Is the grouping cached used by the 
broker distributed or persisted during te failover switch?


was (Author: JIRAUSER300236):
The question is more like in the context of dealing with message duplication 
when grouping is being used during failover switch.

Let me provide some more details to better understand the business case.

Suppose we have a gateway interfaces with an external system to import 
transactions into the database. Such interface consists of two main components:
 * {*}gateway adapter server (producer){*}: *receives* messages from the 
external systems using some APIs and *puts* them on a specific JMS topic
 * {*}gateway loader server (consumer){*}: *consumes* messages from the adapter 
JMS topic, does some processing and *saves* transaction into the database

As the processing is time consuming and the message volumes is very high then 
we have to *balance the gateway loader server* (two or more loader 
servers/consumers can be configured to listen to the same topic. We are using 
*virtual topics* for that.

These external transactions have versioning so we need to ensure that they are 
processed in a specific order (actually in the order they are received). To 
ensure that we are using *JMSXGroupID* which will identity the transaction 
without its version. By using grouping we ensure that the same consumer will 
process all versions of the same transaction. 

Assume external transaction is identified by 
*ExternalSystem+ExternalType+ExternalID.* Thee gateway adapter will set 
*JMSXGroupID* to this value in the JMS message before sending it to the topic. 
If a new version of the same transaction is received from external system then 
the same *JMSXGroupID*  will be set in the message.

Practical example:

*EXT_SWAP_ID* with version *1* will have *JMSXGroupID=EXT_SWAP_ID*

*EXT_SWAP_ID* with version *2* will have *JMSXGroupID=EXT_SWAP_ID*

*EXT_BOND_ID* with version *1* will have *JMSXGroupID=EXT_BOND_ID*

*EXT_BOND_ID* with version *2* will have *JMSXGroupID=EXT_BOND_ID*

*EXT_BOND_ID* with version *3* will have *JMSXGroupID=EXT_BOND_ID*

Let's assume we have two loaders (consumers): *LDR1* and *LDR2* .

Prior to failover we know that:

*LDR1* have processed all messages having {*}JMSXGroupID={*}{*}EXT_SWAP_ID{*}

*LDR2* have processed all messages having {*}JMSXGroupID={*}{*}EXT_BOND_ID{*}

During failover switched we have received two new transactions:

*EXT_SWAP_ID* with version *3* ({*}JMSXGroupID=EXT_SWAP_ID){*}

*EXT_BOND_ID* with version {*}4 ({*}{*}JMSXGroupID=EXT_BOND_ID){*}

*LDR1* and *LDR2* were able to process the transactions meaning:

*LDR1* has processed *EXT_SWAP_ID* with version *3*

*LDR2* has processed *EXT_BOND_ID* with version *4*

However broker was not able to receive message acknowledgement due to network 
interruption (failover switch). After the broker is online it sends again the 
two messages to its consumers.

To handle a message duplication all our consumer listeners are using a *LRU* 
(last recently used) cache of the already processed messages. So if a same 
message is being received twice then it will be skipped. Therefore:

if *LDR1* will receive again  *EXT_SWAP_ID* with version *3* will skip it.

if *LDR2* will receive again *EXT_BOND_ID* with version *4* will skip it.

However, the problem is that after failover switch the transactions are 
received the other way around:

*LDR1* received *EXT_BOND_ID* with version *4* 

*LDR2* received *EXT_SWAP_ID* with version *3*

These messages are considered new to the loaders because they are not in their 
LRU cache and hence they will try to process these transactions.  LDR1 and LDR2 
are separate processes which are not sharing their internal/in-memory resources 
(aka LRU cache).

This leads to the same transaction being imported in the database twice and 
causing several other issues in our application. Actually these transactions 
re-import might fail entirely and in some cases will cause both *LDR1* and 
*LDR2* to malfunction properly.

Is there any setup to circumvent this?

> Message Group does not replicate properly during failover
> ---------------------------------------------------------
>
>                 Key: ARTEMIS-4276
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-4276
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.28.0
>            Reporter: Liviu Citu
>            Priority: Major
>
> Hi,
> We are currently migrating our software from Classic to Artemis and we plan 
> to use failover functionality.
> We were using message group functionality by setting *JMSXGroupID* and this 
> was working as expected. However after failover switch I noticed that 
> messages are sent to wrong consumers.
> Our gateway/interface application is actually a collection of servers:
>  * gateway adapter server: receives messages from an external systems and 
> puts them on a specific/virtual topic
>  * gateway loader server (can be balanced): picks up the messages from the 
> topic and do processing
>  * gateway fail queue: monitors all messages that failed processing and has a 
> functionality of resubmitting the message (users will correct the processing 
> errors and then resubmit transaction)
> *JMSXGroupID* is used to ensure that during message resubmit the same 
> consumer/loader is processing the message as it was originally processed.
> However, if the message resubmit is happening during failover switch we have 
> noticed that the message is not sent to the right consumer as it should. 
> Basically the first available consumer is used which is not what we want.
> I have searched for configuration changes but couldn't find any relevant 
> information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARTEMIS-4276) Message Group does not replicate properly during failover

Reply via email to