[ 
https://issues.apache.org/jira/browse/ARTEMIS-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jozef Tomek updated ARTEMIS-2048:
---------------------------------
    Affects Version/s: 2.6.2

> JCA RA does not failover to backup until TCP connect fails
> ----------------------------------------------------------
>
>                 Key: ARTEMIS-2048
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2048
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.6.2
>         Environment: Artemis 2.6.2 brokers and JCA RAR
> Latest Payara (Full 182)
> JDK 1.8
> Windows machine (same on 7 x64 and 10 x64)
>            Reporter: Jozef Tomek
>            Priority: Major
>              Labels: Failover, HA, JCA, RAR
>
> In cluster configuration with HA replication and UDP broadcast discovery, 
> when both master and backup are properly started and then process for master 
> node is suspended on OS level (Windows), Artemis JCA resource adapter 
> implementation does not properly recognize live being stuck and will not 
> failover to backup until the moment when TCP connections to master will start 
> to get refused.
>  
> If cluster connection on nodes is configured to use low enough timeouts, 
> backup node is able to recognize the problem in meaningful time and become a 
> live. JCA RA however will not connect to now new live for several minutes. 
> It's because calls to
> {code:java}
> 1078: createConnector()
> 1079:
> 1080: openTransportConnection(liveConnector){code}
> in
>  
> {code:java}
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection(){code}
>  
> will not return null (which would be the signal to try to do failover) and 
> thus attempt to communicate with stuck master will fail later (@ 
> clientProtocolManager.checkForFailover(liveNodeID)) in
>  
> {code:java}
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection(){code}
>  
> which causes errors when trying to use connection to broker (both explicit 
> usage and MDBs).
>  
> Most of the time, JCA adapter eventually recognizes live not being there, do 
> a failover and everything starts working again.
> Several times, with my (other) prototype app, I was however able to get 
> adapter stuck in a way that, even though slave (now live) was running just 
> fine, either:
>  * failover happened but not for MDBs somehow - app could explicitly publish 
> messages (get new usable connection from pool), but MDBs were not consuming 
> from queues anymore
>  * failover did not happen at all and both publishing and consuming was not 
> working anymore
> For this I however don't have reliable reproduction steps yet.
> The theory about TCP connections is supported by doing telnets to suspended 
> master's port. For several minutes after suspend, telnet can connect just 
> fine and it changes exactly when I see messages in server logs about doing 
> failover to backup.
>  
> I've prepared small test app, having REST api to publish message to a queue 
> (use included Swagger UI pages) and MDB consuming from the queue.
> On below link you can find source code of the app, scripts for creating 
> master and slave brokers locally, parts of broker.xml config files with 
> required config, resources required to setup Payara. Also patch tracking 
> changes I've made to artemis RA & RAR projects code to make it to run in 
> Payara
> https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to