[jira] [Updated] (ARTEMIS-2048) JCA RA does not failover to backup until TCP connect fails

Jozef Tomek (JIRA) Sat, 15 Sep 2018 09:28:08 -0700


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jozef Tomek updated ARTEMIS-2048:
---------------------------------
    Description: 
In cluster configuration with HA replication and UDP broadcast discovery, when 
both master and backup are properly started and then *process for master node 
is suspended on OS level (Windows)*, Artemis JCA resource adapter 
implementation does not properly recognize live being stuck and will not 
failover to backup until the moment when TCP connections to master will start 
to get refused.

 

If cluster connection on nodes is configured to use low enough timeouts, backup 
node is able to recognize the problem in meaningful time and become a live. JCA 
RA however will not connect to now new live for several minutes. It's because 
calls to
{code:java}
1094: createConnector()

1096: openTransportConnection(liveConnector){code}
in 
{code:java}
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection(){code}
 

will not return null (which would be the signal to try to do failover) and thus 
attempt to communicate with stuck master will fail later at
{code:java}
911: clientProtocolManager.checkForFailover(liveNodeID){code}
in 
{code:java}
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection(){code}
 

which causes errors when trying to use connection to broker (both explicit 
usage and MDBs).

 

Most of the time, JCA adapter eventually recognizes live not being there, do a 
failover and everything starts working again.

Several times, with my (other) prototype app, I was however able to get adapter 
stuck in a way that, even though slave (now live) was running just fine, either:
 * failover happened but not for MDBs somehow - app could explicitly publish 
messages (get new usable connection from pool), but MDBs were not consuming 
from queues anymore
 * failover did not happen at all and both publishing and consuming was not 
working anymore

For this I however don't have reliable reproduction steps yet.

The theory about TCP connections is supported by doing telnets to suspended 
master's port. For several minutes after suspend, telnet can connect just fine 
and it changes exactly when I see messages in server logs about doing failover 
to backup.

 

I've prepared small test app, having REST api to publish message to a queue 
(use included Swagger UI pages) and MDB consuming from the queue.

On below link you can find source code of the app, scripts for creating master 
and slave brokers locally, parts of broker.xml config files with required 
config, resources required to setup Payara. Also patch tracking changes I've 
made to artemis RA & RAR projects code to make it to run in Payara

[https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR]

(app needs "test.input" addesss+queue created beforehand, since MDB consumer 
does not create it automatically, and sets log level for 
"org.apache.activemq.artemis" to ALL)

  was:
In cluster configuration with HA replication and UDP broadcast discovery, when 
both master and backup are properly started and then process for master node is 
suspended on OS level (Windows), Artemis JCA resource adapter implementation 
does not properly recognize live being stuck and will not failover to backup 
until the moment when TCP connections to master will start to get refused.

 

If cluster connection on nodes is configured to use low enough timeouts, backup 
node is able to recognize the problem in meaningful time and become a live. JCA 
RA however will not connect to now new live for several minutes. It's because 
calls to
{code:java}
1094: createConnector()

1096: openTransportConnection(liveConnector){code}
in 
{code:java}
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection(){code}
 

will not return null (which would be the signal to try to do failover) and thus 
attempt to communicate with stuck master will fail later at
{code:java}
911: clientProtocolManager.checkForFailover(liveNodeID){code}
in 
{code:java}
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection(){code}
 

which causes errors when trying to use connection to broker (both explicit 
usage and MDBs).

 

Most of the time, JCA adapter eventually recognizes live not being there, do a 
failover and everything starts working again.

Several times, with my (other) prototype app, I was however able to get adapter 
stuck in a way that, even though slave (now live) was running just fine, either:
 * failover happened but not for MDBs somehow - app could explicitly publish 
messages (get new usable connection from pool), but MDBs were not consuming 
from queues anymore
 * failover did not happen at all and both publishing and consuming was not 
working anymore

For this I however don't have reliable reproduction steps yet.

The theory about TCP connections is supported by doing telnets to suspended 
master's port. For several minutes after suspend, telnet can connect just fine 
and it changes exactly when I see messages in server logs about doing failover 
to backup.

 

I've prepared small test app, having REST api to publish message to a queue 
(use included Swagger UI pages) and MDB consuming from the queue.

On below link you can find source code of the app, scripts for creating master 
and slave brokers locally, parts of broker.xml config files with required 
config, resources required to setup Payara. Also patch tracking changes I've 
made to artemis RA & RAR projects code to make it to run in Payara

[https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR]

(app needs "test.input" addesss+queue created beforehand, since MDB consumer 
does not create it automatically, and sets log level for 
"org.apache.activemq.artemis" to ALL)


> JCA RA does not failover to backup until TCP connect fails
> ----------------------------------------------------------
>
>                 Key: ARTEMIS-2048
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2048
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.6.2
>         Environment: Latest Payara (Full 182)
> JDK 1.8
> Windows machine (same on 7 x64 and 10 x64)
>            Reporter: Jozef Tomek
>            Priority: Major
>              Labels: Failover, HA, JCA, RAR
>             Fix For: 2.7.0, 2.6.4
>
>
> In cluster configuration with HA replication and UDP broadcast discovery, 
> when both master and backup are properly started and then *process for master 
> node is suspended on OS level (Windows)*, Artemis JCA resource adapter 
> implementation does not properly recognize live being stuck and will not 
> failover to backup until the moment when TCP connections to master will start 
> to get refused.
>  
> If cluster connection on nodes is configured to use low enough timeouts, 
> backup node is able to recognize the problem in meaningful time and become a 
> live. JCA RA however will not connect to now new live for several minutes. 
> It's because calls to
> {code:java}
> 1094: createConnector()
> 1096: openTransportConnection(liveConnector){code}
> in 
> {code:java}
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection(){code}
>  
> will not return null (which would be the signal to try to do failover) and 
> thus attempt to communicate with stuck master will fail later at
> {code:java}
> 911: clientProtocolManager.checkForFailover(liveNodeID){code}
> in 
> {code:java}
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection(){code}
>  
> which causes errors when trying to use connection to broker (both explicit 
> usage and MDBs).
>  
> Most of the time, JCA adapter eventually recognizes live not being there, do 
> a failover and everything starts working again.
> Several times, with my (other) prototype app, I was however able to get 
> adapter stuck in a way that, even though slave (now live) was running just 
> fine, either:
>  * failover happened but not for MDBs somehow - app could explicitly publish 
> messages (get new usable connection from pool), but MDBs were not consuming 
> from queues anymore
>  * failover did not happen at all and both publishing and consuming was not 
> working anymore
> For this I however don't have reliable reproduction steps yet.
> The theory about TCP connections is supported by doing telnets to suspended 
> master's port. For several minutes after suspend, telnet can connect just 
> fine and it changes exactly when I see messages in server logs about doing 
> failover to backup.
>  
> I've prepared small test app, having REST api to publish message to a queue 
> (use included Swagger UI pages) and MDB consuming from the queue.
> On below link you can find source code of the app, scripts for creating 
> master and slave brokers locally, parts of broker.xml config files with 
> required config, resources required to setup Payara. Also patch tracking 
> changes I've made to artemis RA & RAR projects code to make it to run in 
> Payara
> [https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR]
> (app needs "test.input" addesss+queue created beforehand, since MDB consumer 
> does not create it automatically, and sets log level for 
> "org.apache.activemq.artemis" to ALL)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARTEMIS-2048) JCA RA does not failover to backup until TCP connect fails

Reply via email to