Ephemeris Lappis created AMQ-9653:
-------------------------------------
Summary: ActiveMQ doesn't process and deliver messages into/from
queues in right order
Key: AMQ-9653
URL: https://issues.apache.org/jira/browse/AMQ-9653
Project: ActiveMQ Classic
Issue Type: Bug
Components: Broker
Affects Versions: 5.18.3
Environment: ActiveMQ is executed as a pod in Kubernetes clusters,
with a OpenJDK 64-Bit Server VM version 17.0.9+9 (Eclipse Adoptium).
The persistance DB is on a VM (not a container)
The Karaf has the same configuration in the same cluster namespace.
All in Linux images 5.15.0-116-generic #126-Ubuntu
Reporter: Ephemeris Lappis
After some load tests we can see random cases of messages that are delivered in
an unexpected order. I our case this breaks the application for which the
processing order is mandatory.
FYI, we use a JDBC (PostgreSQL) persistance, although it may not be directly
concerned.
We've set some hacks to try to analyze where and when this occurs. First of
them is a trigger on the "activemq_msgs" table to insert all the messages in a
mirror table to keep them. Then we've substituted the application's queue
consumer by a small Camel route that consumes messages and checks their logical
orders. When a message arrives in a wrong sequence, a log is written in Karaf,
and the message is copied into another backup queue. We also have the logs of
the Camel routes that produce the messages that gives the origin chronology.
Here an example where 5 messages constitute the "conversation". The table below
(I put it at the end of the text) is built from the message mirror table. Its
columns are :
- the "id" (primary key) generated by ActiveMQ : this gives a first idea about
message ordering.
- the "ie_id" is our business identifier for a given batch of messages.
- the "correlation_id" has its name says, a ID that may correlate different
conversations, and is shared all over the application system.
- the "etape", is the business step. I'll give more details about that later.
- "date_traitement" is the current date that is set by the producer just before
the message is sent.
- "jms_timestamp_ms" and "jms_timestamp" are the JMSTimestamp of the message.
- the "commit_time" is the pgsql transaction time of the inserted row.
The logical order of our steps in almost all cases of conversation is :
- 1 "initier_echange"
- 1 "initier_utr" for each sub-batch (Logical transactional unit -- TU)
- 0 or more "diffuser_donnees_utr" for each TU (one for every generated data).
- 1 "diffuser_utr" that closes the sub-batch.
Our most common use cases manage only one UT with 1 or 2 data.
According to the table, we can see that the "diffuser_utr" has been processed
by ActiveMQ before the last "diffuser_donnees_utr". This last message seems to
have been temporarly lost for a while : it's been persisted, and then delivered
a very long time after it's been sent. See that the "date_traitement" is right,
but both the "id" of the row and its commit time do not match. The JMS
timestamps also seem inconsistent, first comparing with the "date_traitement",
and then with a same value for the 4 last messages, even if the last one has a
quite delay before it's been inserted and dispatched. Finally, the last message
has been received by the consumer at 2025-01-23T15:45:39,236, thus the thrown
error.
This is just an example. In some other cases the very first message is
delivered after all the others.
On the broker side, no log about any strange behavior. The system and JVM
monitoring do not show any resource lack issue.
FYI, our load tests are done with a JMeter injector that set a 20 threads
group, each one sending messages to 16 queues with loops from 10 to 100. The 16
queues are processed by 16 Camel routes that inderectly produce 4 or 5 messages
into the "buggy queue" for every received input. In the application, a single
application with a single consumer processes them. The same in our testing
Camel consumer.
As the broker is some kind of black box for us, we can't give more information
about how it manages this kind of load test that is more or less what could
happen in a real rush situation.
Of course I can provide us with any detail you could ask for. I've not found
any intersting case googling for similar issues, but perhaps someone already
had some.
{{Thanks in advance.}}
{{{color:#0747a6}|id |ie_id |correlation_id
|etape |date_traitement
|jms_timestamp_ms|jms_timestamp |commit_time
|{color}}}
{{{color:#0747a6}|------|------------------------------------|------------------------------------|--------------------|------------------------|----------------|-----------------------------|-----------------------------|{color}}}
{{{color:#0747a6}|8 124
|5009d4f3-dfac-4de2-8360-32999ee6aac7|6b371cdc-7071-4dfa-8c58-231770df4f8e|initier_echange
|2025-01-23T14:45:15.963Z|1737643516108 |2025-01-23 15:45:16.108
+0100|2025-01-23 15:45:15.964 +0100|{color}}}
{{{color:#0747a6}|8 142
|5009d4f3-dfac-4de2-8360-32999ee6aac7|6b371cdc-7071-4dfa-8c58-231770df4f8e|initier_utr
|2025-01-23T14:45:16.003Z|1737643515969 |2025-01-23 15:45:15.969
+0100|2025-01-23 15:45:16.005 +0100|{color}}}
{{{color:#0747a6}|8 165
|5009d4f3-dfac-4de2-8360-32999ee6aac7|6b371cdc-7071-4dfa-8c58-231770df4f8e|diffuser_donnees_utr|2025-01-23T14:45:16.062Z|1737643515969
|2025-01-23 15:45:15.969 +0100|2025-01-23 15:45:16.064 +0100|{color}}}
{{{color:#0747a6}|8 166
|5009d4f3-dfac-4de2-8360-32999ee6aac7|6b371cdc-7071-4dfa-8c58-231770df4f8e|diffuser_utr
|2025-01-23T14:45:16.067Z|1737643515969 |2025-01-23 15:45:15.969
+0100|2025-01-23 15:45:16.069 +0100|{color}}}
{{{color:#0747a6}|18
982|5009d4f3-dfac-4de2-8360-32999ee6aac7|6b371cdc-7071-4dfa-8c58-231770df4f8e|diffuser_donnees_utr|2025-01-23T14:45:16.062Z|1737643515969
|2025-01-23 15:45:15.969 +0100|2025-01-23 15:45:39.265 +0100|{color}}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact