[ 
https://issues.apache.org/jira/browse/ARTEMIS-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francesco Nigro updated ARTEMIS-2958:
-------------------------------------
    Summary: Timed out waiting pool stop on backup restart  (was: Timed out 
waiting for pool slow down backup restart on failback)

> Timed out waiting pool stop on backup restart
> ---------------------------------------------
>
>                 Key: ARTEMIS-2958
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2958
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker, JMX
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Minor
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The changes on ARTEMIS-2823 have caused to await the AMQ thread pool on 
> server stop to be moved after {{callDeActiveCallbacks()}} while the changes 
> on ARTEMIS-2838 to 
> {{server.getServer().getManagementService().unregisterHawtioSecurity()}} on 
> {{callDeActiveCallbacks()}} have moved the HawtioSecurity de-registration to 
> happen on server stop.
> it means that on server restart, if the thread pool has a slow stop, JMX 
> won't be available until a new start.
> A slow thread pool stop can happen if there is any long task still 
> running/blocked in the pool: the default strategy while stopping the broker 
> is to await 10 seconds before force a shutdown of the pending tasks ie JMX 
> can be unavailable for at least 10 seconds after {{callDeActiveCallbacks()}} 
> and before a new (pre)activation will register it again.
> This stealthy behaviour has been captured by random failures on 
> {{org.apache.activemq.artemis.tests.smoke.jmxfailback.JmxFailbackTest#testFailbackOnJMX}},
>  due to some randomly blocked task in the thread pool.
> The core issue that cause the thread pool to be randomly blocked was present 
> by long time, but the unavailability time window of JMX introduced by the 
> mentioned JIRAs was the change that has triggered the bomb.
> The test check for 5 seconds the availability of backup JMX connection during 
> a backup restart (on failback): given that the default thread pool await time 
> is 10 seconds, a longer thread pool stop will make the test to fail.
> It seems, by a thread dump inspection that the pending task is:
> {code:java}
> jmx-failback2-out:"Thread-1 
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@1f97cf0d)"
>  Id=43 TIMED_WAITING on 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
> jmx-failback2-out:      at sun.misc.Unsafe.park(Native Method)
> jmx-failback2-out:      -  waiting on 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
> jmx-failback2-out:      at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> jmx-failback2-out:      at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:504)
> jmx-failback2-out:      -  locked java.lang.Object@607e79a2
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:271)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$38/327040562.run(Unknown
>  Source)
> jmx-failback2-out:      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> jmx-failback2-out:      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> jmx-failback2-out:
> jmx-failback2-out:      Number of locked synchronizers = 1
> jmx-failback2-out:      - 
> java.util.concurrent.ThreadPoolExecutor$Worker@6e676dc8
> {code}
> And indeed it seems that {{BackupManager::stop}} that's calling 
> {{BackupConnector::close}} just {{closeLocator(backupServerLocator)}} that 
> won't unblock {{clusterControl.authorize()}}.
> A first fix would be to correctly unblock any blocking call to clean stop 
> {{BackupManager}} and let the broker thread pool to immediately stop. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to