[jira] [Resolved] (ARTEMIS-2048) JCA RA does not failover to backup until TCP connect fails

Justin Bertram (Jira) Fri, 14 Nov 2025 12:42:46 -0800


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Justin Bertram resolved ARTEMIS-2048.
-------------------------------------
    Resolution: Not A Bug

Suspending a process at the OS level is a unique operation that doesn't mimic a 
real-world use-case (e.g. a hardware or software crash), and it's not something 
I would expect a administrator to do during normal operation. Therefore, I 
don't believe that issues found for this use-case merit real investigation. If 
you can demonstrate an issue with a real-world use-case please re-open this 
issue with an explanation and steps to reproduce. Thanks!

> JCA RA does not failover to backup until TCP connect fails
> ----------------------------------------------------------
>
>                 Key: ARTEMIS-2048
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2048
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.6.2
>         Environment: Latest Payara (Full 182)
> JDK 1.8
> Windows machine (same on 7 x64 and 10 x64)
>            Reporter: Jozef Tomek
>            Priority: Major
>              Labels: Failover, HA, JCA, RAR
>
> In cluster configuration with HA replication and UDP broadcast discovery, 
> when both master and backup are properly started and then *process for master 
> node is suspended on OS level (Windows)*, Artemis JCA resource adapter 
> implementation does not properly recognize live being stuck and will not 
> failover to backup until the moment when TCP connections to master will start 
> to get refused.
>  
> If cluster connection on nodes is configured to use low enough timeouts, 
> backup node is able to recognize the problem in meaningful time and become a 
> live. JCA RA however will not connect to now new live for several minutes. 
> It's because calls to
> {code:java}
> 1094: createConnector()
> 1096: openTransportConnection(liveConnector){code}
> in 
> {code:java}
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection(){code}
>  
> will not return null (which would be the signal to try to do failover) and 
> thus attempt to communicate with stuck master will fail later at
> {code:java}
> 911: clientProtocolManager.checkForFailover(liveNodeID){code}
> in 
> {code:java}
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection(){code}
>  
> which causes errors when trying to use connection to broker (both explicit 
> usage and MDBs).
>  
> Most of the time, JCA adapter eventually recognizes live not being there, do 
> a failover and everything starts working again.
> Several times, with my (other) prototype app, I was however able to get 
> adapter stuck in a way that, even though slave (now live) was running just 
> fine, either:
>  * failover happened but not for MDBs somehow - app could explicitly publish 
> messages (get new usable connection from pool), but MDBs were not consuming 
> from queues anymore
>  * failover did not happen at all and both publishing and consuming was not 
> working anymore
> For this I however don't have reliable reproduction steps yet.
> The theory about TCP connections is supported by doing telnets to suspended 
> master's port. For several minutes after suspend, telnet can connect just 
> fine and it changes exactly when I see messages in server logs about doing 
> failover to backup.
>  
> I've prepared small test app, having REST api to publish message to a queue 
> (use included Swagger UI pages) and MDB consuming from the queue.
> On below link you can find source code of the app, scripts for creating 
> master and slave brokers locally, parts of broker.xml config files with 
> required config, resources required to setup Payara. Also patch tracking 
> changes I've made to artemis RA & RAR projects code to make it to run in 
> Payara
> [https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR]
> (app needs "test.input" addesss+queue created beforehand, since MDB consumer 
> does not create it automatically, and sets log level for 
> "org.apache.activemq.artemis" to ALL)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact

[jira] [Resolved] (ARTEMIS-2048) JCA RA does not failover to backup until TCP connect fails

Reply via email to