[
https://issues.apache.org/jira/browse/ARTEMIS-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jozef Tomek updated ARTEMIS-2048:
---------------------------------
Description:
In cluster configuration with HA replication and UDP broadcast discovery, when
both master and backup are properly started and then *process for master node
is suspended on OS level (Windows)*, Artemis JCA resource adapter
implementation does not properly recognize live being stuck and will not
failover to backup until the moment when TCP connections to master will start
to get refused.
If cluster connection on nodes is configured to use low enough timeouts, backup
node is able to recognize the problem in meaningful time and become a live. JCA
RA however will not connect to now new live for several minutes. It's because
calls to
{code:java}
1094: createConnector()
1096: openTransportConnection(liveConnector){code}
in
{code:java}
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection(){code}
will not return null (which would be the signal to try to do failover) and thus
attempt to communicate with stuck master will fail later at
{code:java}
911: clientProtocolManager.checkForFailover(liveNodeID){code}
in
{code:java}
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection(){code}
which causes errors when trying to use connection to broker (both explicit
usage and MDBs).
Most of the time, JCA adapter eventually recognizes live not being there, do a
failover and everything starts working again.
Several times, with my (other) prototype app, I was however able to get adapter
stuck in a way that, even though slave (now live) was running just fine, either:
* failover happened but not for MDBs somehow - app could explicitly publish
messages (get new usable connection from pool), but MDBs were not consuming
from queues anymore
* failover did not happen at all and both publishing and consuming was not
working anymore
For this I however don't have reliable reproduction steps yet.
The theory about TCP connections is supported by doing telnets to suspended
master's port. For several minutes after suspend, telnet can connect just fine
and it changes exactly when I see messages in server logs about doing failover
to backup.
I've prepared small test app, having REST api to publish message to a queue
(use included Swagger UI pages) and MDB consuming from the queue.
On below link you can find source code of the app, scripts for creating master
and slave brokers locally, parts of broker.xml config files with required
config, resources required to setup Payara. Also patch tracking changes I've
made to artemis RA & RAR projects code to make it to run in Payara
[https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR]
(app needs "test.input" addesss+queue created beforehand, since MDB consumer
does not create it automatically, and sets log level for
"org.apache.activemq.artemis" to ALL)
was:
In cluster configuration with HA replication and UDP broadcast discovery, when
both master and backup are properly started and then process for master node is
suspended on OS level (Windows), Artemis JCA resource adapter implementation
does not properly recognize live being stuck and will not failover to backup
until the moment when TCP connections to master will start to get refused.
If cluster connection on nodes is configured to use low enough timeouts, backup
node is able to recognize the problem in meaningful time and become a live. JCA
RA however will not connect to now new live for several minutes. It's because
calls to
{code:java}
1094: createConnector()
1096: openTransportConnection(liveConnector){code}
in
{code:java}
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection(){code}
will not return null (which would be the signal to try to do failover) and thus
attempt to communicate with stuck master will fail later at
{code:java}
911: clientProtocolManager.checkForFailover(liveNodeID){code}
in
{code:java}
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection(){code}
which causes errors when trying to use connection to broker (both explicit
usage and MDBs).
Most of the time, JCA adapter eventually recognizes live not being there, do a
failover and everything starts working again.
Several times, with my (other) prototype app, I was however able to get adapter
stuck in a way that, even though slave (now live) was running just fine, either:
* failover happened but not for MDBs somehow - app could explicitly publish
messages (get new usable connection from pool), but MDBs were not consuming
from queues anymore
* failover did not happen at all and both publishing and consuming was not
working anymore
For this I however don't have reliable reproduction steps yet.
The theory about TCP connections is supported by doing telnets to suspended
master's port. For several minutes after suspend, telnet can connect just fine
and it changes exactly when I see messages in server logs about doing failover
to backup.
I've prepared small test app, having REST api to publish message to a queue
(use included Swagger UI pages) and MDB consuming from the queue.
On below link you can find source code of the app, scripts for creating master
and slave brokers locally, parts of broker.xml config files with required
config, resources required to setup Payara. Also patch tracking changes I've
made to artemis RA & RAR projects code to make it to run in Payara
[https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR]
(app needs "test.input" addesss+queue created beforehand, since MDB consumer
does not create it automatically, and sets log level for
"org.apache.activemq.artemis" to ALL)
> JCA RA does not failover to backup until TCP connect fails
> ----------------------------------------------------------
>
> Key: ARTEMIS-2048
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2048
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 2.6.2
> Environment: Latest Payara (Full 182)
> JDK 1.8
> Windows machine (same on 7 x64 and 10 x64)
> Reporter: Jozef Tomek
> Priority: Major
> Labels: Failover, HA, JCA, RAR
> Fix For: 2.7.0, 2.6.4
>
>
> In cluster configuration with HA replication and UDP broadcast discovery,
> when both master and backup are properly started and then *process for master
> node is suspended on OS level (Windows)*, Artemis JCA resource adapter
> implementation does not properly recognize live being stuck and will not
> failover to backup until the moment when TCP connections to master will start
> to get refused.
>
> If cluster connection on nodes is configured to use low enough timeouts,
> backup node is able to recognize the problem in meaningful time and become a
> live. JCA RA however will not connect to now new live for several minutes.
> It's because calls to
> {code:java}
> 1094: createConnector()
> 1096: openTransportConnection(liveConnector){code}
> in
> {code:java}
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection(){code}
>
> will not return null (which would be the signal to try to do failover) and
> thus attempt to communicate with stuck master will fail later at
> {code:java}
> 911: clientProtocolManager.checkForFailover(liveNodeID){code}
> in
> {code:java}
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection(){code}
>
> which causes errors when trying to use connection to broker (both explicit
> usage and MDBs).
>
> Most of the time, JCA adapter eventually recognizes live not being there, do
> a failover and everything starts working again.
> Several times, with my (other) prototype app, I was however able to get
> adapter stuck in a way that, even though slave (now live) was running just
> fine, either:
> * failover happened but not for MDBs somehow - app could explicitly publish
> messages (get new usable connection from pool), but MDBs were not consuming
> from queues anymore
> * failover did not happen at all and both publishing and consuming was not
> working anymore
> For this I however don't have reliable reproduction steps yet.
> The theory about TCP connections is supported by doing telnets to suspended
> master's port. For several minutes after suspend, telnet can connect just
> fine and it changes exactly when I see messages in server logs about doing
> failover to backup.
>
> I've prepared small test app, having REST api to publish message to a queue
> (use included Swagger UI pages) and MDB consuming from the queue.
> On below link you can find source code of the app, scripts for creating
> master and slave brokers locally, parts of broker.xml config files with
> required config, resources required to setup Payara. Also patch tracking
> changes I've made to artemis RA & RAR projects code to make it to run in
> Payara
> [https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR]
> (app needs "test.input" addesss+queue created beforehand, since MDB consumer
> does not create it automatically, and sets log level for
> "org.apache.activemq.artemis" to ALL)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)