[jira] [Commented] (ARTEMIS-4276) Message Group does not replicate properly during failover

Liviu Citu (Jira) Wed, 17 May 2023 04:23:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723419#comment-17723419
 ]


Liviu Citu commented on ARTEMIS-4276:
-------------------------------------

The question is more like in the context of dealing with message duplication 
when grouping is being used.

Let me provide some more details to better understand the business case.

Our software is using ActiveMQ JMS Broker for message distribution across all 
of its clients and servers. We have some gateway interfaces with external 
systems to import transactions into the database. Such interface consists of 
two main components:
 * {*}gateway adapter server (producer){*}: receives messages from the external 
systems using some APIs and *puts* them on a specific JMS topic
 * {*}gateway loader server (consumer){*}: consumes messages from the adapter 
JMS topic, do some processing and save transaction into the database

As the processing is time consuming and the message volumes is very high then 
we have to *balance the gateway loader server* (two or more loader 
servers/consumers can be configured to listen to the same producer. We can have 
multiple consumers of the same topic by using *virtual topics.*

These external transactions have versioning so we need to ensure that they are 
processed in a specific order (actually in the order they are received). To 
ensure that we are using *JMSXGroupID* which will identity the transaction 
without its version. By using grouping we ensure that the same consumer will 
process all versions of the same transaction. 

External transaction is identified by *ExternalSystem+ExternalType+ExternalID.* 
Thee gateway adapter will set *JMSXGroupID* to this value in the JMS message 
before sending it to the topic. If a new version of the same transaction is 
received from external system then the same *JMSXGroupID*  will be set in the 
message.

Practical example:

*EXT_SWAP_ID1* with version 1 will have *JMSXGroupID=EXT_SWAP_ID1*

*EXT_SWAP_ID1* with version 2 will have *JMSXGroupID=EXT_SWAP_ID1*

*EXT_BOND_ID1* with version 1 will have *JMSXGroupID=EXT_BOND_ID1*

*EXT_BOND_ID1* with version 2 will have *JMSXGroupID=EXT_BOND_ID1*

*EXT_BOND_ID1* ** with version 3 will have ** 
{*}JMSXGroupID=EXT_BOND_ID1{*}{*}{*}

Let's assume we have two loaders (consumers): *LDR1* and *LDR2* .

Prior to failover we know that:

*LDR1* have processed all messages having {*}JMSXGroupID={*}{*}EXT_SWAP_ID1{*}

*LDR2* ** have processed all messages having 
{*}JMSXGroupID={*}{*}EXT_BOND_ID1{*}{*}{*}

During failover switched we have received two transactions:

*EXT_SWAP_ID1* with version 3 ({*}JMSXGroupID=EXT_SWAP_ID1){*}

*EXT_BOND_ID1* with version 4 {*}({*}{*}JMSXGroupID=EXT_BOND_ID1){*}{*}{*}

*LDR1* and *LDR2* were able to process the transactions meaning:

*LDR1* has processed *EXT_SWAP_ID1* with version 3

*LDR2* ** has processed *EXT_BOND_ID1* with version 4{*}{*}

However when they sent the message acknowledge to the broker then the broker 
was not able to receive them due to network interruption (failover switch). 
After the broker is online it sends again the two messages to its consumers.

To handle a message duplication all our consumer listeners are using a LRU 
(last recently used) cache of the already processed messages. So if a same 
message is being received then it will be skipped. Therefore:

if *LDR1* will receive again  *EXT_SWAP_ID1* with version 3 will skip it.

if *LDR2* ** will receive again *EXT_BOND_ID1* with version 4 will skip it.

However, the problem is that after failover switch:

*LDR1* received *EXT_BOND_ID1* with version 4 

*LDR2* received *EXT_BOND_ID1* with version 3

These messages are considered new to them because they are not in their LRU 
cache and hence will try to process the transactions. This leads to the same 
transaction being imported in the database and causing issues from financial 
point of view. Actually these transactions re-import might fail now and in some 
cases will cause both *LDR1* and *LDR2* to stop processing.

Is there any setup to circumvent this?

> Message Group does not replicate properly during failover
> ---------------------------------------------------------
>
>                 Key: ARTEMIS-4276
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-4276
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.28.0
>            Reporter: Liviu Citu
>            Priority: Major
>
> Hi,
> We are currently migrating our software from Classic to Artemis and we plan 
> to use failover functionality.
> We were using message group functionality by setting *JMSXGroupID* and this 
> was working as expected. However after failover switch I noticed that 
> messages are sent to wrong consumers.
> Our gateway/interface application is actually a collection of servers:
>  * gateway adapter server: receives messages from an external systems and 
> puts them on a specific/virtual topic
>  * gateway loader server (can be balanced): picks up the messages from the 
> topic and do processing
>  * gateway fail queue: monitors all messages that failed processing and has a 
> functionality of resubmitting the message (users will correct the processing 
> errors and then resubmit transaction)
> *JMSXGroupID* is used to ensure that during message resubmit the same 
> consumer/loader is processing the message as it was originally processed.
> However, if the message resubmit is happening during failover switch we have 
> noticed that the message is not sent to the right consumer as it should. 
> Basically the first available consumer is used which is not what we want.
> I have searched for configuration changes but couldn't find any relevant 
> information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARTEMIS-4276) Message Group does not replicate properly during failover

Reply via email to