[
https://issues.apache.org/jira/browse/ARTEMIS-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jozef Tomek updated ARTEMIS-2048:
---------------------------------
Affects Version/s: 2.6.2
> JCA RA does not failover to backup until TCP connect fails
> ----------------------------------------------------------
>
> Key: ARTEMIS-2048
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2048
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 2.6.2
> Environment: Artemis 2.6.2 brokers and JCA RAR
> Latest Payara (Full 182)
> JDK 1.8
> Windows machine (same on 7 x64 and 10 x64)
> Reporter: Jozef Tomek
> Priority: Major
> Labels: Failover, HA, JCA, RAR
>
> In cluster configuration with HA replication and UDP broadcast discovery,
> when both master and backup are properly started and then process for master
> node is suspended on OS level (Windows), Artemis JCA resource adapter
> implementation does not properly recognize live being stuck and will not
> failover to backup until the moment when TCP connections to master will start
> to get refused.
>
> If cluster connection on nodes is configured to use low enough timeouts,
> backup node is able to recognize the problem in meaningful time and become a
> live. JCA RA however will not connect to now new live for several minutes.
> It's because calls to
> {code:java}
> 1078: createConnector()
> 1079:
> 1080: openTransportConnection(liveConnector){code}
> in
>
> {code:java}
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection(){code}
>
> will not return null (which would be the signal to try to do failover) and
> thus attempt to communicate with stuck master will fail later (@
> clientProtocolManager.checkForFailover(liveNodeID)) in
>
> {code:java}
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection(){code}
>
> which causes errors when trying to use connection to broker (both explicit
> usage and MDBs).
>
> Most of the time, JCA adapter eventually recognizes live not being there, do
> a failover and everything starts working again.
> Several times, with my (other) prototype app, I was however able to get
> adapter stuck in a way that, even though slave (now live) was running just
> fine, either:
> * failover happened but not for MDBs somehow - app could explicitly publish
> messages (get new usable connection from pool), but MDBs were not consuming
> from queues anymore
> * failover did not happen at all and both publishing and consuming was not
> working anymore
> For this I however don't have reliable reproduction steps yet.
> The theory about TCP connections is supported by doing telnets to suspended
> master's port. For several minutes after suspend, telnet can connect just
> fine and it changes exactly when I see messages in server logs about doing
> failover to backup.
>
> I've prepared small test app, having REST api to publish message to a queue
> (use included Swagger UI pages) and MDB consuming from the queue.
> On below link you can find source code of the app, scripts for creating
> master and slave brokers locally, parts of broker.xml config files with
> required config, resources required to setup Payara. Also patch tracking
> changes I've made to artemis RA & RAR projects code to make it to run in
> Payara
> https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)