Markus created AMQ-7455:
---------------------------
Summary: Blocking XA Periodic Recovery and Container Operations
when deploying ActiveMQ RA 5.15.x to JBoss EAP 6.4
Key: AMQ-7455
URL: https://issues.apache.org/jira/browse/AMQ-7455
Project: ActiveMQ
Issue Type: Bug
Components: RAR
Affects Versions: 5.15.12, 5.15.11
Reporter: Markus
*Scenario*: deploy ActiveMQ RA to EAP6.4 with XA enabled DataSources for JDBC
and ActiveMQ Artemis as JMS broker in a failover config (active/passive) and
run several applications that use JMS and JDBC datasources. (In our scenario,
there are 4 XA JDBC DataSources and 42 deployed WAR files). ActiveMQ resource
adapter is used as the Artemis one is not compatible to this EAP version.
JBoss EAP and Artemis are running on the same host in the examples below, but
this does not matter. Artemis is running with example HA-by-network-replication
settings on localhost:61616 and localhost:61617.
The resource adapter settings on JBoss EAP side and on client side (especially
properties of the broker URIs for ActiveMQ's connection factory) are exactly as
described in [Red Hat's reference documentation for HA
config|https://access.redhat.com/documentation/en-us/red_hat_jboss_a-mq/6.1/html-single/integrating_with_jboss_enterprise_application_platform/index]:
* Add {{?jms.rmIdFromConnectionId=true}} to the connection URI
* {{<is-same-rm-override>false</is-same-rm-override>}} in standalone.xml
*Observed Symptoms*: On a regular basis, _undeployment of WAR files take
minutes (around 10) instead of seconds_, blocking other container management
operations during this timeframe. This behavior can be traced back to running
ActiveMQ RA and Artemis as JMS broker. If running the same WAR files and same
config, but using JBoss EAP's internal HornetQ broker, these delays do not
occur and undeployments take mere seconds.
*Further Investigation*: This only affects JavaEE modules that define at least
one MDB (message driven bean). During shutdown of the beans, the container
management thread is blocked while de-registering the bean's XA Resource, see
following stack trace taken while a simple module is being undeployed for
reference:
{noformat}
"ServerService Thread Pool -- 277" #3709 prio=5 os_prio=0
tid=0x000000001d6e0000 nid=0x4df4 in Object.wait() [0x000000007c41e000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.waitForScanState(XARecoveryModule.java:935)
at
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.removeXAResourceRecoveryHelper(XARecoveryModule.java:92)
- locked <0x000000008162bd38> (a
java.util.concurrent.atomic.AtomicInteger)
at
com.arjuna.ats.jbossatx.jta.RecoveryManagerService.removeXAResourceRecovery(RecoveryManagerService.java:117)
at
org.jboss.jca.core.tx.jbossts.XAResourceRecoveryRegistryImpl.removeXAResourceRecovery(XAResourceRecoveryRegistryImpl.java:63)
at org.jboss.jca.core.rar.EndpointImpl.deactivate(EndpointImpl.java:226)
at
org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponent.deactivate(MessageDrivenComponent.java:224)
at
org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponent.stop(MessageDrivenComponent.java:195)
// and a couple of more elements
{noformat}
During these 10 minutes of blocking while the module is supposed to be
undeployed, EAP's log is literally spammed with thousands of these lines:
{noformat}
11:19:43,640 INFO [org.apache.activemq.transport.failover.FailoverTransport]
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,643 INFO [org.apache.activemq.transport.failover.FailoverTransport]
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,647 INFO [org.apache.activemq.transport.failover.FailoverTransport]
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,657 INFO [org.apache.activemq.transport.failover.FailoverTransport]
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,661 INFO [org.apache.activemq.transport.failover.FailoverTransport]
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,664 INFO [org.apache.activemq.transport.failover.FailoverTransport]
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
{noformat}
Only after these lines stop to be logged the undeployment process continues.
*What happens during this time in ActiveMQ:* Obviously there are thousands of
connections established to Artemis during this time. Tracing back the code path
that leads to these calls points to EAP's Periodic XA Resource Recovery thread,
see the following full stack trace (taken at the same point in time):
{noformat}
"Periodic Recovery" #96 prio=5 os_prio=0 tid=0x000000002484a000 nid=0x1b28
waiting for monitor entry [0x0000000028dbe000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.Object.wait(Native Method)
at
org.apache.activemq.transport.failover.FailoverTransport.oneway(FailoverTransport.java:623)
- locked <0x00000000cc69eeb8> (a java.lang.Object)
at
org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
at
org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
at
org.apache.activemq.transport.ResponseCorrelator.request(ResponseCorrelator.java:86)
at
org.apache.activemq.ActiveMQConnection.syncSendPacket(ActiveMQConnection.java:1392)
at
org.apache.activemq.ActiveMQConnection.ensureConnectionInfoSent(ActiveMQConnection.java:1486)
- locked <0x00000000cc69ff10> (a java.lang.Object)
at
org.apache.activemq.ActiveMQConnection.start(ActiveMQConnection.java:527)
at
org.apache.activemq.ra.ActiveMQResourceAdapter$2.newConnection(ActiveMQResourceAdapter.java:361)
at
org.apache.activemq.ra.ActiveMQResourceAdapter$2.getResourceManagerId(ActiveMQResourceAdapter.java:278)
at
org.apache.activemq.TransactionContext.isSameRM(TransactionContext.java:666)
at
org.apache.activemq.ra.ActiveMQResourceAdapter$2.isSameRM(ActiveMQResourceAdapter.java:261)
at
org.jboss.jca.core.tx.jbossts.XAResourceWrapperImpl.isSameRM(XAResourceWrapperImpl.java:159)
at
com.arjuna.ats.internal.jta.recovery.arjunacore.RecoveryXids.isSameRM(RecoveryXids.java:143)
at
com.arjuna.ats.internal.jta.recovery.arjunacore.RecoveryXids.updateIfEquivalentRM(RecoveryXids.java:215)
at
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.refreshXidScansForEquivalentXAResourceImpl(XARecoveryModule.java:796)
at
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass(XARecoveryModule.java:570)
at
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkFirstPass(XARecoveryModule.java:181)
- locked <0x000000008162ba98> (a
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule)
at
com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:747)
at
com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:375){noformat}
>From this stack trace I want to point out especially this line:
>{{org.apache.activemq.TransactionContext.isSameRM()}}
This method is part of the XA Resource Contract. Simply put the periodic
recovery process iterates over XA Resources and looks for dormant XA
transactions that need to be finished in some way or the other. Here it checks
different combinations of XA Resources to asses whether they came from the same
Resource Manager ({{.isSameRM()}}).
During this recovery process (which in the stack trace above is still in the
phase of gathering basic information and not actually working on XA
transactions) the deregistration of the message driven beans is blocked.
*Analysis for the super long duration of recovery process*: The {{XAResource}}
instances that periodic recovery works on are created by
{{org.apache.activemq.ra.ActiveMQResourceAdapter.getXAResources(ActivationSpec[])}}.
The anonymous class that extends {{org.apache.activemq.TransactionContext}}
and that is instantiated here does basically the following steps for each
invocation of any of the XA Resource contract methods:
# Replace the current connection to Artemis with a freshly created one
# Call {{*super*}}
# Restore the current connection and close the new one
Specifically for {{isSameRM}} this results in the following chain of events:
# Create a pristine connection to remote Artemis (in our case a
FailoverTransport)
# Call {{*super*.isSameRM(xaresource)}}
# If {{xaresource}} is {{*null*}} or is not an instance of
{{TransactionContext}}, return false
({{org/apache/activemq/TransactionContext.java:658}})
# Get the current resource manager id and {{xaresource}}'s resource manager
id. This effectively calls {{*this*.getResourceManagerId()}}. As noted above,
this leads to the following calls:
# Replace the current connection to Artemis (established in step 1) with a
second pristine one
# Call {{*super*.getResourceManagerId()}}
# Call {{getResourceManagerId()}} on the connection of step 5
# In {{org/apache/activemq/ActiveMQConnection.java:1622}}, the resource
manager id effectively is the Connection Id in our setup (according to the flag
set in the connection URI).
# This id is _unique for each connection_ (see
{{org/apache/activemq/ActiveMQConnection.java:240}})
# Compare the resulting value with the resource manager id of xaresource and
return the result
# Close the connection to the JMS broker established in step 5 and set the one
established in step 1 as "current" connection
# Close the connection established in step 1 and reset current connection to
whatever was there before (in my understanding that is always {{*null*}} as the
instances are always being freshly created inside
{{ActiveMQResourceAdapter.getXAResources}})
This is repeated for each pair of XA Resources, or rather Activation Specs,
that the recovery manager encounters. The log spamming stems from steps 1 and 5
because each time a INFO message is logged tht tells us that a new connection
has been established to the server.
This also happens when the recovery process wants to know if a JDBC connection
to our SQL server (which we have plenty of due to several modules with a
connection pooling data source each) is from the same resource manager as a JMS
connection to Artemis - which it obviously never is.
*Problematic Aspects of this code*:
# It doesn't make sense to establish pristine connections to the remote broker
for each invocation of one of {{XAResource}}'s methods. The connection setup is
expensive with a plain non-failover URI as it is, but even more so with a
failover transport - this has been supported by tests and also the code shows
this.
# If there is the option to find an answer to {{isSameRM()}} without having to
do any work, this should be done as early as possible.
# If the connection factory is configured with {{rmIdFromConnectionId}}
enabled, then {{isSameRM()}} in my understanding will never return {{*true*}}.
*Possible solutions for each of these aspects*:
# In order to change the basic pattern of
create-connection-then-do-something-then-close-connection that happens insideĀ
{{ActiveMQResourceAdapter.}}{{getXAResources}}{{()}}, a big portion of this
code would propably have to be re-written.
# In the current code base, the instanceof check (see step 3 above) can be
extraced and inlined into the anonymous inner class defined inside
{{ActiveMQResourceAdapter.}}{{getXAResources}}{{()}}
# I don't know about the semantic correctness of this, i.e. if this behavior
is desired; but if it is, and all the code I read strengthens this assumption,
then in this case this is another candidate for short-circuting the expensive
connection creation: if this config flag is set on the connection factory,
directly return {{*false*}}.
So, basically, the solutions 2 and 3 somewhat mitigate the antipattern(?)
described in 1. And only this aspect is addressed by the pull request that is
linked to this ticket.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)