[
https://issues.apache.org/jira/browse/ARTEMIS-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059917#comment-18059917
]
Claudiu Chioasca edited comment on ARTEMIS-5895 at 2/20/26 2:35 PM:
--------------------------------------------------------------------
I just replicated with 2.51.0.
Though I attached the source code of my test already, it's not runnable on your
side because there's a dependency to some internal lib you don't have access to.
I will prepare a vanilla maven project so you can run it on your side.
Here's some excerpt from today's log, it took 101 iterations + some over 2hrs
to happen.
_Connection brokerURL = tcp://localhost:4175_
_|NAME |ADDRESS
|CONSUMER|MESSAGE|MESSAGES|DELIVERING|MESSAGES|SCHEDULED|ROUTING|INTERNAL|_
_| | | COUNT | COUNT | ADDED | COUNT | ACKED |
COUNT | TYPE | |_
_|failover-queue|failover-queue| 0 | 9999 | 9999 | 0 | 0 |
0 |ANYCAST| false |_
_[2026-02-20 15:26:16] [INFO] Queue 'failover-queue' has 9999 messages_
_*[2026-02-20 15:26:16] [INFO] ========== ITERATION 101 RESULTS ==========*_
_[2026-02-20 15:26:16] [INFO] Expected messages: 10000_
_[2026-02-20 15:26:16] [INFO] Client reported sent: 10000_
_[2026-02-20 15:26:16] [INFO] Actual messages in queue: 9999_
_[2026-02-20 15:26:16] [INFO] Kill delay was: 4827 ms (after messages started)_
_[2026-02-20 15:26:16] [INFO] Test failed: True_
_[2026-02-20 15:26:16] [ERROR] !!! BUG DETECTED !!! 1 messages lost (client
sent 10000 but queue has 9999)_
The replication pattern was the same as the one reported already, in this case
message *2804* out of 10k {*}looks as if successfully committed, but it's not
found in the destination queue{*}, and it's the message just before the retried
one, which successfully gets to the destination:
*Sent message Test Message 2804*
2026-02-20T15:25:39.717+02:00 WARN 48116 — [168.100.58:5176]
o.a.a.t.failover.FailoverTransport : Transport (ssl://dev-machine:5176)
failed, attempting to automatically reconnect
java.io.EOFException
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:210)
~[na:na]
at java.base/java.io.DataInputStream.readInt(DataInputStream.java:385) ~[na:na]
at
org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:289)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:241)
~[activemq-client-6.1.8.jar:6.1.8]
at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:233)
~[activemq-client-6.1.8.jar:6.1.8]
at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:216)
~[activemq-client-6.1.8.jar:6.1.8]
at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]
2026-02-20T15:25:41.364+02:00 INFO 48116 — [ActiveMQ Task-2]
o.a.a.t.failover.FailoverTransport : Successfully reconnected to
ssl://dev-machine:4176
2026-02-20T15:25:41.364+02:00 INFO 48116 — [ main]
org.apache.activemq.TransactionContext : commit failed for transaction
TX:ID:dev-machine-55914-1771593929093-1:1:2806
jakarta.jms.TransactionRolledBackException: Transaction completion in doubt due
to failover. Forcing rollback of TX:ID:dev-machine-55914-1771593929093-1:1:2806
at
org.apache.activemq.state.ConnectionStateTracker.restoreTransactions(ConnectionStateTracker.java:274)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.state.ConnectionStateTracker.restore(ConnectionStateTracker.java:211)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.transport.failover.FailoverTransport.restoreTransport(FailoverTransport.java:860)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.transport.failover.FailoverTransport.doReconnect(FailoverTransport.java:1029)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.transport.failover.FailoverTransport$2.iterate(FailoverTransport.java:151)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:133)
~[activemq-client-6.1.8.jar:6.1.8]
at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:48)
~[activemq-client-6.1.8.jar:6.1.8]
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
~[na:na]
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
~[na:na]
at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]
*Rolled back message Test Message 2805*
*Sent message Test Message 2805*
was (Author: k100000):
I just replicated with 2.51.0.
Though I attached the source code of my test already, it's not runnable on your
side because there's a dependency to some internal lib you don't have access to.
I will prepare a vanilla maven project so you can run it on your side.
Here's some excerpt from today's log, it took 101 iterations + some over 2hrs
to happen.
_Connection brokerURL = tcp://localhost:4175_
_|NAME |ADDRESS
|CONSUMER|MESSAGE|MESSAGES|DELIVERING|MESSAGES|SCHEDULED|ROUTING|INTERNAL|_
_| | | COUNT | COUNT | ADDED | COUNT | ACKED |
COUNT | TYPE | |_
_|failover-queue|failover-queue| 0 | 9999 | 9999 | 0 | 0 |
0 |ANYCAST| false |_
_[2026-02-20 15:26:16] [INFO] Queue 'failover-queue' has 9999 messages_
_*[2026-02-20 15:26:16] [INFO] ========== ITERATION 101 RESULTS ==========*_
_[2026-02-20 15:26:16] [INFO] Expected messages: 10000_
_[2026-02-20 15:26:16] [INFO] Client reported sent: 10000_
_[2026-02-20 15:26:16] [INFO] Actual messages in queue: 9999_
_[2026-02-20 15:26:16] [INFO] Kill delay was: 4827 ms (after messages started)_
_[2026-02-20 15:26:16] [INFO] Test failed: True_
_[2026-02-20 15:26:16] [ERROR] !!! BUG DETECTED !!! 1 messages lost (client
sent 10000 but queue has 9999)_
The replication pattern was the same as the one reported already, in this case
message *2804* out of 10k {*}looks as if successfully committed, but it's not
found in the destination queue{*}, and it's the message just before the retried
one, which successfully gets to the destination:
*Sent message Test Message 2804*
2026-02-20T15:25:39.717+02:00 WARN 48116 --- [168.100.58:5176]
o.a.a.t.failover.FailoverTransport : Transport (ssl://dev-machine:5176)
failed, attempting to automatically reconnect
java.io.EOFException
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:210)
~[na:na]
at java.base/java.io.DataInputStream.readInt(DataInputStream.java:385) ~[na:na]
at
org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:289)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:241)
~[activemq-client-6.1.8.jar:6.1.8]
at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:233)
~[activemq-client-6.1.8.jar:6.1.8]
at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:216)
~[activemq-client-6.1.8.jar:6.1.8]
at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]
2026-02-20T15:25:41.364+02:00 INFO 48116 --- [ActiveMQ Task-2]
o.a.a.t.failover.FailoverTransport : Successfully reconnected to
ssl://dev-machine:4176
2026-02-20T15:25:41.364+02:00 INFO 48116 --- [ main]
org.apache.activemq.TransactionContext : commit failed for transaction
TX:ID:dev-machine-55914-1771593929093-1:1:2806
jakarta.jms.TransactionRolledBackException: Transaction completion in doubt due
to failover. Forcing rollback of TX:ID:BUCLFZ9JY04-55914-1771593929093-1:1:2806
at
org.apache.activemq.state.ConnectionStateTracker.restoreTransactions(ConnectionStateTracker.java:274)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.state.ConnectionStateTracker.restore(ConnectionStateTracker.java:211)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.transport.failover.FailoverTransport.restoreTransport(FailoverTransport.java:860)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.transport.failover.FailoverTransport.doReconnect(FailoverTransport.java:1029)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.transport.failover.FailoverTransport$2.iterate(FailoverTransport.java:151)
~[activemq-client-6.1.8.jar:6.1.8]
at
org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:133)
~[activemq-client-6.1.8.jar:6.1.8]
at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:48)
~[activemq-client-6.1.8.jar:6.1.8]
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
~[na:na]
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
~[na:na]
at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]
*Rolled back message Test Message 2805*
*Sent message Test Message 2805*
> Message loss during failover switch in shared store configuration
> -----------------------------------------------------------------
>
> Key: ARTEMIS-5895
> URL: https://issues.apache.org/jira/browse/ARTEMIS-5895
> Project: Artemis
> Issue Type: Bug
> Components: OpenWire
> Affects Versions: 2.44.0
> Reporter: Claudiu Chioasca
> Assignee: Clebert Suconic
> Priority: Critical
> Attachments: 2372.png, 2373_missing.png, 2374.png,
> FailoverApplicationTests.java, QueueSender.java, artemis.log,
> failover-queue.png, failover-test-automation.ps1,
> producer-bug-detected-iteration-82.log
>
>
> Sometimes, a message producer connected via OPENWIRE protocol and calling
> commit() over a transacted session is not signaled with an exception when
> failover switch happens and the commit fails.
>
> My test consists of:
>
> - primary/backup artemis instances deployed with shared store configuration
> (2.44.0)
>
> - a JDK21 spring boot (4.0.1) based producer:
>
> <dependency>
> <groupId>org.springframework.boot</groupId>
> <artifactId>spring-boot-starter-activemq</artifactId>
> </dependency>
>
> that connects to the broker via failover url:
> failover:(ssl://LOCAL-DEV:5176,ssl://LOCAL-DEV:4176)
>
> - this scenario: while both primary & backup are up, producer starts sending
> 10000 messages to "failover-queue" destination, during this time the primary
> instance is shut down using "artemis stop". The producer is configured to
> retry when session.commit() fails
>
> - a script to repeat the same sequence of steps until message loss is
> detected: restart brokers, purge test destination, execute spring boot test,
> shut down primary when messages start to appear in test destination, count
> the messages when the test finishes
>
> I let the script running for a couple of hours until it replicated,
> producer-bug-detected-iteration-82.log shows the output of the producer +
> script detecting the loss.
> I attached the primary instance log at the time it was stopping and message
> #2373 was lost.
> The 2373_missing.png is a capture of Artemis console for the failover-queue
> destination, where it can be noticed 2372 & 2374 are consecutive.
> The producer log shows the 2374 first send is rolled-back, then retried as
> expected, but 2373 send appears successful.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]