[
https://issues.apache.org/jira/browse/ARTEMIS-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Clebert Suconic closed ARTEMIS-2958.
------------------------------------
Fix Version/s: 2.16.0
Resolution: Fixed
> Timed out waiting pool stop on backup restart
> ---------------------------------------------
>
> Key: ARTEMIS-2958
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2958
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker, JMX
> Reporter: Francesco Nigro
> Assignee: Francesco Nigro
> Priority: Minor
> Fix For: 2.16.0
>
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> The changes on ARTEMIS-2823 have caused the AMQ thread pool stop on server
> stop to be moved after {{callDeActiveCallbacks()}}, while the changes on
> ARTEMIS-2838 to
> {{server.getServer().getManagementService().unregisterHawtioSecurity()}} on
> {{callDeActiveCallbacks()}} have moved the HawtioSecurity de-registration to
> happen on server stop.
> it means that on server restart, if the thread pool has a slow stop, JMX
> won't be available until a new start.
> The thread pool stop can block stopping if there is a long task still
> running/blocked in the pool and the default strategy while stopping the
> broker is to await 10 seconds before forcing a shutdown of the pending tasks
> ie JMX can be unavailable for at least 10 seconds after
> {{callDeActiveCallbacks()}} and before a new (pre)activation will register it
> again.
> This stealthy behaviour has been captured by random failures on
> {{org.apache.activemq.artemis.tests.smoke.jmxfailback.JmxFailbackTest#testFailbackOnJMX}},
> due to some randomly blocked task in the thread pool.
> The core issue that cause the thread pool to be randomly blocked was present
> by long time, but the unavailability time window of JMX introduced by the
> mentioned JIRAs was the change that has triggered the bomb.
> The test check for 5 seconds the availability of backup JMX connection during
> a backup restart (on failback): given that the default thread pool await time
> is 10 seconds, a longer thread pool stop will make the test to fail.
> It seems by a thread dump inspection that the pending task is:
> {code:java}
> jmx-failback2-out:"Thread-1
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@1f97cf0d)"
> Id=43 TIMED_WAITING on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
> jmx-failback2-out: at sun.misc.Unsafe.park(Native Method)
> jmx-failback2-out: - waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
> jmx-failback2-out: at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> jmx-failback2-out: at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
> jmx-failback2-out: at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:504)
> jmx-failback2-out: - locked java.lang.Object@607e79a2
> jmx-failback2-out: at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
> jmx-failback2-out: at
> org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80)
> jmx-failback2-out: at
> org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:271)
> jmx-failback2-out: at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> jmx-failback2-out: at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> jmx-failback2-out: at
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
> jmx-failback2-out: at
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$38/327040562.run(Unknown
> Source)
> jmx-failback2-out: at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> jmx-failback2-out: at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> jmx-failback2-out: at
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> jmx-failback2-out:
> jmx-failback2-out: Number of locked synchronizers = 1
> jmx-failback2-out: -
> java.util.concurrent.ThreadPoolExecutor$Worker@6e676dc8
> {code}
> And indeed it seems that both {{BackupManager::stop}} and
> {{BackupManager::activated}} are calling {{BackupConnector::close}} that's
> calling {{closeLocator(backupServerLocator)}} without unblocking
> {{clusterControl.authorize()}}.
> A possible fix would be to correctly unblock any blocking call on both cases,
> to clean stop {{BackupManager}} and let the broker thread pool to immediately
> stop.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)