[ https://issues.apache.org/jira/browse/ARTEMIS-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Francesco Nigro updated ARTEMIS-2958: ------------------------------------- Summary: Timed out waiting pool stop on backup restart (was: Timed out waiting for pool slow down backup restart on failback) > Timed out waiting pool stop on backup restart > --------------------------------------------- > > Key: ARTEMIS-2958 > URL: https://issues.apache.org/jira/browse/ARTEMIS-2958 > Project: ActiveMQ Artemis > Issue Type: Bug > Components: Broker, JMX > Reporter: Francesco Nigro > Assignee: Francesco Nigro > Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > The changes on ARTEMIS-2823 have caused to await the AMQ thread pool on > server stop to be moved after {{callDeActiveCallbacks()}} while the changes > on ARTEMIS-2838 to > {{server.getServer().getManagementService().unregisterHawtioSecurity()}} on > {{callDeActiveCallbacks()}} have moved the HawtioSecurity de-registration to > happen on server stop. > it means that on server restart, if the thread pool has a slow stop, JMX > won't be available until a new start. > A slow thread pool stop can happen if there is any long task still > running/blocked in the pool: the default strategy while stopping the broker > is to await 10 seconds before force a shutdown of the pending tasks ie JMX > can be unavailable for at least 10 seconds after {{callDeActiveCallbacks()}} > and before a new (pre)activation will register it again. > This stealthy behaviour has been captured by random failures on > {{org.apache.activemq.artemis.tests.smoke.jmxfailback.JmxFailbackTest#testFailbackOnJMX}}, > due to some randomly blocked task in the thread pool. > The core issue that cause the thread pool to be randomly blocked was present > by long time, but the unavailability time window of JMX introduced by the > mentioned JIRAs was the change that has triggered the bomb. > The test check for 5 seconds the availability of backup JMX connection during > a backup restart (on failback): given that the default thread pool await time > is 10 seconds, a longer thread pool stop will make the test to fail. > It seems, by a thread dump inspection that the pending task is: > {code:java} > jmx-failback2-out:"Thread-1 > (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@1f97cf0d)" > Id=43 TIMED_WAITING on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1 > jmx-failback2-out: at sun.misc.Unsafe.park(Native Method) > jmx-failback2-out: - waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1 > jmx-failback2-out: at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > jmx-failback2-out: at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) > jmx-failback2-out: at > org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:504) > jmx-failback2-out: - locked java.lang.Object@607e79a2 > jmx-failback2-out: at > org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434) > jmx-failback2-out: at > org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80) > jmx-failback2-out: at > org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:271) > jmx-failback2-out: at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) > jmx-failback2-out: at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) > jmx-failback2-out: at > org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) > jmx-failback2-out: at > org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$38/327040562.run(Unknown > Source) > jmx-failback2-out: at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > jmx-failback2-out: at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > jmx-failback2-out: at > org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) > jmx-failback2-out: > jmx-failback2-out: Number of locked synchronizers = 1 > jmx-failback2-out: - > java.util.concurrent.ThreadPoolExecutor$Worker@6e676dc8 > {code} > And indeed it seems that {{BackupManager::stop}} that's calling > {{BackupConnector::close}} just {{closeLocator(backupServerLocator)}} that > won't unblock {{clusterControl.authorize()}}. > A first fix would be to correctly unblock any blocking call to clean stop > {{BackupManager}} and let the broker thread pool to immediately stop. -- This message was sent by Atlassian Jira (v8.3.4#803005)