[jira] [Closed] (ARTEMIS-2958) Timed out waiting pool stop on backup restart

Clebert Suconic (Jira) Mon, 02 Nov 2020 17:16:50 -0800


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Clebert Suconic closed ARTEMIS-2958.
------------------------------------
    Fix Version/s: 2.16.0
       Resolution: Fixed

> Timed out waiting pool stop on backup restart
> ---------------------------------------------
>
>                 Key: ARTEMIS-2958
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2958
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker, JMX
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Minor
>             Fix For: 2.16.0
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The changes on ARTEMIS-2823 have caused the AMQ thread pool stop on server 
> stop to be moved after {{callDeActiveCallbacks()}}, while the changes on 
> ARTEMIS-2838 to 
> {{server.getServer().getManagementService().unregisterHawtioSecurity()}} on 
> {{callDeActiveCallbacks()}} have moved the HawtioSecurity de-registration to 
> happen on server stop.
> it means that on server restart, if the thread pool has a slow stop, JMX 
> won't be available until a new start.
> The thread pool stop can block stopping if there is a long task still 
> running/blocked in the pool and the default strategy while stopping the 
> broker is to await 10 seconds before forcing a shutdown of the pending tasks 
> ie JMX can be unavailable for at least 10 seconds after 
> {{callDeActiveCallbacks()}} and before a new (pre)activation will register it 
> again.
> This stealthy behaviour has been captured by random failures on 
> {{org.apache.activemq.artemis.tests.smoke.jmxfailback.JmxFailbackTest#testFailbackOnJMX}},
>  due to some randomly blocked task in the thread pool.
> The core issue that cause the thread pool to be randomly blocked was present 
> by long time, but the unavailability time window of JMX introduced by the 
> mentioned JIRAs was the change that has triggered the bomb.
> The test check for 5 seconds the availability of backup JMX connection during 
> a backup restart (on failback): given that the default thread pool await time 
> is 10 seconds, a longer thread pool stop will make the test to fail.
> It seems by a thread dump inspection that the pending task is:
> {code:java}
> jmx-failback2-out:"Thread-1 
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@1f97cf0d)"
>  Id=43 TIMED_WAITING on 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
> jmx-failback2-out:      at sun.misc.Unsafe.park(Native Method)
> jmx-failback2-out:      -  waiting on 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
> jmx-failback2-out:      at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> jmx-failback2-out:      at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:504)
> jmx-failback2-out:      -  locked java.lang.Object@607e79a2
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:271)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$38/327040562.run(Unknown
>  Source)
> jmx-failback2-out:      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> jmx-failback2-out:      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> jmx-failback2-out:
> jmx-failback2-out:      Number of locked synchronizers = 1
> jmx-failback2-out:      - 
> java.util.concurrent.ThreadPoolExecutor$Worker@6e676dc8
> {code}
> And indeed it seems that both {{BackupManager::stop}} and 
> {{BackupManager::activated}} are calling {{BackupConnector::close}} that's 
> calling {{closeLocator(backupServerLocator)}} without unblocking 
> {{clusterControl.authorize()}}.
> A possible fix would be to correctly unblock any blocking call on both cases, 
> to clean stop {{BackupManager}} and let the broker thread pool to immediately 
> stop. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARTEMIS-2958) Timed out waiting pool stop on backup restart

Reply via email to