[jira] [Created] (AMQ-7455) Blocking XA Periodic Recovery and Container Operations when deploying ActiveMQ RA 5.15.x to JBoss EAP 6.4

Markus (Jira) Thu, 26 Mar 2020 06:35:11 -0700

Markus created AMQ-7455:
---------------------------

             Summary: Blocking XA Periodic Recovery and Container Operations 
when deploying ActiveMQ RA 5.15.x to JBoss EAP 6.4
                 Key: AMQ-7455
                 URL: https://issues.apache.org/jira/browse/AMQ-7455
             Project: ActiveMQ
          Issue Type: Bug
          Components: RAR
    Affects Versions: 5.15.12, 5.15.11
            Reporter: Markus



*Scenario*: deploy ActiveMQ RA to EAP6.4 with XA enabled DataSources for JDBC 
and ActiveMQ Artemis as JMS broker in a failover config (active/passive) and 
run several applications that use JMS and JDBC datasources. (In our scenario, 
there are 4 XA JDBC DataSources and 42 deployed WAR files). ActiveMQ resource 
adapter is used as the Artemis one is not compatible to this EAP version.

JBoss EAP and Artemis are running on the same host in the examples below, but 
this does not matter. Artemis is running with example HA-by-network-replication 
settings on localhost:61616 and localhost:61617.

The resource adapter settings on JBoss EAP side and on client side (especially 
properties of the broker URIs for ActiveMQ's connection factory) are exactly as 
described in [Red Hat's reference documentation for HA 
config|https://access.redhat.com/documentation/en-us/red_hat_jboss_a-mq/6.1/html-single/integrating_with_jboss_enterprise_application_platform/index]:
 * Add {{?jms.rmIdFromConnectionId=true}} to the connection URI
 * {{<is-same-rm-override>false</is-same-rm-override>}} in standalone.xml

*Observed Symptoms*: On a regular basis, _undeployment of WAR files take 
minutes (around 10) instead of seconds_, blocking other container management 
operations during this timeframe. This behavior can be traced back to running 
ActiveMQ RA and Artemis as JMS broker. If running the same WAR files and same 
config, but using JBoss EAP's internal HornetQ broker, these delays do not 
occur and undeployments take mere seconds.

*Further Investigation*: This only affects JavaEE modules that define at least 
one MDB (message driven bean). During shutdown of the beans, the container 
management thread is blocked while de-registering the bean's XA Resource, see 
following stack trace taken while a simple module is being undeployed for 
reference:
{noformat}
"ServerService Thread Pool -- 277" #3709 prio=5 os_prio=0 
tid=0x000000001d6e0000 nid=0x4df4 in Object.wait() [0x000000007c41e000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at 
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.waitForScanState(XARecoveryModule.java:935)
        at 
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.removeXAResourceRecoveryHelper(XARecoveryModule.java:92)
        - locked <0x000000008162bd38> (a 
java.util.concurrent.atomic.AtomicInteger)
        at 
com.arjuna.ats.jbossatx.jta.RecoveryManagerService.removeXAResourceRecovery(RecoveryManagerService.java:117)
        at 
org.jboss.jca.core.tx.jbossts.XAResourceRecoveryRegistryImpl.removeXAResourceRecovery(XAResourceRecoveryRegistryImpl.java:63)
        at org.jboss.jca.core.rar.EndpointImpl.deactivate(EndpointImpl.java:226)
        at 
org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponent.deactivate(MessageDrivenComponent.java:224)
        at 
org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponent.stop(MessageDrivenComponent.java:195)
// and a couple of more elements
{noformat}
During these 10 minutes of blocking while the module is supposed to be 
undeployed, EAP's log is literally spammed with thousands of these lines:
{noformat}
11:19:43,640 INFO  [org.apache.activemq.transport.failover.FailoverTransport] 
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,643 INFO  [org.apache.activemq.transport.failover.FailoverTransport] 
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,647 INFO  [org.apache.activemq.transport.failover.FailoverTransport] 
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,657 INFO  [org.apache.activemq.transport.failover.FailoverTransport] 
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,661 INFO  [org.apache.activemq.transport.failover.FailoverTransport] 
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,664 INFO  [org.apache.activemq.transport.failover.FailoverTransport] 
(ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
{noformat}
Only after these lines stop to be logged the undeployment process continues.

*What happens during this time in ActiveMQ:* Obviously there are thousands of 
connections established to Artemis during this time. Tracing back the code path 
that leads to these calls points to EAP's Periodic XA Resource Recovery thread, 
see the following full stack trace (taken at the same point in time):
{noformat}
"Periodic Recovery" #96 prio=5 os_prio=0 tid=0x000000002484a000 nid=0x1b28 
waiting for monitor entry [0x0000000028dbe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.lang.Object.wait(Native Method)
        at 
org.apache.activemq.transport.failover.FailoverTransport.oneway(FailoverTransport.java:623)
        - locked <0x00000000cc69eeb8> (a java.lang.Object)
        at 
org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
        at 
org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
        at 
org.apache.activemq.transport.ResponseCorrelator.request(ResponseCorrelator.java:86)
        at 
org.apache.activemq.ActiveMQConnection.syncSendPacket(ActiveMQConnection.java:1392)
        at 
org.apache.activemq.ActiveMQConnection.ensureConnectionInfoSent(ActiveMQConnection.java:1486)
        - locked <0x00000000cc69ff10> (a java.lang.Object)
        at 
org.apache.activemq.ActiveMQConnection.start(ActiveMQConnection.java:527)
        at 
org.apache.activemq.ra.ActiveMQResourceAdapter$2.newConnection(ActiveMQResourceAdapter.java:361)
        at 
org.apache.activemq.ra.ActiveMQResourceAdapter$2.getResourceManagerId(ActiveMQResourceAdapter.java:278)
        at 
org.apache.activemq.TransactionContext.isSameRM(TransactionContext.java:666)
        at 
org.apache.activemq.ra.ActiveMQResourceAdapter$2.isSameRM(ActiveMQResourceAdapter.java:261)
        at 
org.jboss.jca.core.tx.jbossts.XAResourceWrapperImpl.isSameRM(XAResourceWrapperImpl.java:159)
        at 
com.arjuna.ats.internal.jta.recovery.arjunacore.RecoveryXids.isSameRM(RecoveryXids.java:143)
        at 
com.arjuna.ats.internal.jta.recovery.arjunacore.RecoveryXids.updateIfEquivalentRM(RecoveryXids.java:215)
        at 
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.refreshXidScansForEquivalentXAResourceImpl(XARecoveryModule.java:796)
        at 
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass(XARecoveryModule.java:570)
        at 
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkFirstPass(XARecoveryModule.java:181)
        - locked <0x000000008162ba98> (a 
com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule)
        at 
com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:747)
        at 
com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:375){noformat}
>From this stack trace I want to point out especially this line: 
>{{org.apache.activemq.TransactionContext.isSameRM()}}

This method is part of the XA Resource Contract. Simply put the periodic 
recovery process iterates over XA Resources and looks for dormant XA 
transactions that need to be finished in some way or the other. Here it checks 
different combinations of XA Resources to asses whether they came from the same 
Resource Manager ({{.isSameRM()}}).
 During this recovery process (which in the stack trace above is still in the 
phase of gathering basic information and not actually working on XA 
transactions) the deregistration of the message driven beans is blocked.

*Analysis for the super long duration of recovery process*: The {{XAResource}} 
instances that periodic recovery works on are created by 
{{org.apache.activemq.ra.ActiveMQResourceAdapter.getXAResources(ActivationSpec[])}}.
 The anonymous class that extends {{org.apache.activemq.TransactionContext}} 
and that is instantiated here does basically the following steps for each 
invocation of any of the XA Resource contract methods:
 # Replace the current connection to Artemis with a freshly created one
 # Call {{*super*}}
 # Restore the current connection and close the new one

Specifically for {{isSameRM}} this results in the following chain of events:
 # Create a pristine connection to remote Artemis (in our case a 
FailoverTransport)
 # Call {{*super*.isSameRM(xaresource)}}
 # If {{xaresource}} is {{*null*}} or is not an instance of 
{{TransactionContext}}, return false 
({{org/apache/activemq/TransactionContext.java:658}})
 # Get the current resource manager id and {{xaresource}}'s resource manager 
id. This effectively calls {{*this*.getResourceManagerId()}}. As noted above, 
this leads to the following calls:
 # Replace the current connection to Artemis (established in step 1) with a 
second pristine one
 # Call {{*super*.getResourceManagerId()}}
 # Call {{getResourceManagerId()}} on the connection of step 5
 # In {{org/apache/activemq/ActiveMQConnection.java:1622}}, the resource 
manager id effectively is the Connection Id in our setup (according to the flag 
set in the connection URI).
 # This id is _unique for each connection_ (see 
{{org/apache/activemq/ActiveMQConnection.java:240}})
 # Compare the resulting value with the resource manager id of xaresource and 
return the result
 # Close the connection to the JMS broker established in step 5 and set the one 
established in step 1 as "current" connection
 # Close the connection established in step 1 and reset current connection to 
whatever was there before (in my understanding that is always {{*null*}} as the 
instances are always being freshly created inside 
{{ActiveMQResourceAdapter.getXAResources}})

This is repeated for each pair of XA Resources, or rather Activation Specs, 
that the recovery manager encounters. The log spamming stems from steps 1 and 5 
because each time a INFO message is logged tht tells us that a new connection 
has been established to the server.

This also happens when the recovery process wants to know if a JDBC connection 
to our SQL server (which we have plenty of due to several modules with a 
connection pooling data source each) is from the same resource manager as a JMS 
connection to Artemis - which it obviously never is.

*Problematic Aspects of this code*:
 # It doesn't make sense to establish pristine connections to the remote broker 
for each invocation of one of {{XAResource}}'s methods. The connection setup is 
expensive with a plain non-failover URI as it is, but even more so with a 
failover transport - this has been supported by tests and also the code shows 
this.
 # If there is the option to find an answer to {{isSameRM()}} without having to 
do any work, this should be done as early as possible.
 # If the connection factory is configured with {{rmIdFromConnectionId}} 
enabled, then {{isSameRM()}} in my understanding will never return {{*true*}}.

*Possible solutions for each of these aspects*:
 # In order to change the basic pattern of 
create-connection-then-do-something-then-close-connection that happens inside  
{{ActiveMQResourceAdapter.}}{{getXAResources}}{{()}}, a big portion of this 
code would propably have to be re-written.
 # In the current code base, the instanceof check (see step 3 above) can be 
extraced and inlined into the anonymous inner class defined inside 
{{ActiveMQResourceAdapter.}}{{getXAResources}}{{()}}
 # I don't know about the semantic correctness of this, i.e. if this behavior 
is desired; but if it is, and all the code I read strengthens this assumption, 
then in this case this is another candidate for short-circuting the expensive 
connection creation: if this config flag is set on the connection factory, 
directly return {{*false*}}.

So, basically, the solutions 2 and 3 somewhat mitigate the antipattern(?) 
described in 1. And only this aspect is addressed by the pull request that is 
linked to this ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (AMQ-7455) Blocking XA Periodic Recovery and Container Operations when deploying ActiveMQ RA 5.15.x to JBoss EAP 6.4

Reply via email to