[jira] [Updated] (ARTEMIS-2958) Timed out waiting pool stop on backup restart

Francesco Nigro (Jira) Thu, 22 Oct 2020 06:36:50 -0700


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Francesco Nigro updated ARTEMIS-2958:
-------------------------------------
    Description: 
The changes on ARTEMIS-2823 have caused the AMQ thread pool stop on server stop 
to be moved after {{callDeActiveCallbacks()}}, while the changes on 
ARTEMIS-2838 to 
{{server.getServer().getManagementService().unregisterHawtioSecurity()}} on 
{{callDeActiveCallbacks()}} have moved the HawtioSecurity de-registration to 
happen on server stop.
it means that on server restart, if the thread pool has a slow stop, JMX won't 
be available until a new start.
The thread pool stop can block stopping if there is a long task still 
running/blocked in the pool and the default strategy while stopping the broker 
is to await 10 seconds before forcing a shutdown of the pending tasks ie JMX 
can be unavailable for at least 10 seconds after {{callDeActiveCallbacks()}} 
and before a new (pre)activation will register it again.

This stealthy behaviour has been captured by random failures on 
{{org.apache.activemq.artemis.tests.smoke.jmxfailback.JmxFailbackTest#testFailbackOnJMX}},
 due to some randomly blocked task in the thread pool.
The core issue that cause the thread pool to be randomly blocked was present by 
long time, but the unavailability time window of JMX introduced by the 
mentioned JIRAs was the change that has triggered the bomb.
The test check for 5 seconds the availability of backup JMX connection during a 
backup restart (on failback): given that the default thread pool await time is 
10 seconds, a longer thread pool stop will make the test to fail.

It seems by a thread dump inspection that the pending task is:

{code:java}
jmx-failback2-out:"Thread-1 
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@1f97cf0d)"
 Id=43 TIMED_WAITING on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
jmx-failback2-out:      at sun.misc.Unsafe.park(Native Method)
jmx-failback2-out:      -  waiting on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
jmx-failback2-out:      at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
jmx-failback2-out:      at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
jmx-failback2-out:      at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:504)
jmx-failback2-out:      -  locked java.lang.Object@607e79a2
jmx-failback2-out:      at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
jmx-failback2-out:      at 
org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80)
jmx-failback2-out:      at 
org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:271)
jmx-failback2-out:      at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
jmx-failback2-out:      at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
jmx-failback2-out:      at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
jmx-failback2-out:      at 
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$38/327040562.run(Unknown
 Source)
jmx-failback2-out:      at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
jmx-failback2-out:      at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
jmx-failback2-out:      at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
jmx-failback2-out:
jmx-failback2-out:      Number of locked synchronizers = 1
jmx-failback2-out:      - 
java.util.concurrent.ThreadPoolExecutor$Worker@6e676dc8
{code}

And indeed it seems that {{BackupManager::stop}} that's calling 
{{BackupConnector::close}} just {{closeLocator(backupServerLocator)}} that 
won't unblock {{clusterControl.authorize()}}.
A first fix would be to correctly unblock any blocking call to clean stop 
{{BackupManager}} and let the broker thread pool to immediately stop. 


  was:
The changes on ARTEMIS-2823 have caused the AMQ thread pool stop on server stop 
to be moved after {{callDeActiveCallbacks()}}, while the changes on 
ARTEMIS-2838 to 
{{server.getServer().getManagementService().unregisterHawtioSecurity()}} on 
{{callDeActiveCallbacks()}} have moved the HawtioSecurity de-registration to 
happen on server stop.
it means that on server restart, if the thread pool has a slow stop, JMX won't 
be available until a new start.
A slow thread pool stop can happen if there is any long task still 
running/blocked in the pool: the default strategy while stopping the broker is 
to await 10 seconds before force a shutdown of the pending tasks ie JMX can be 
unavailable for at least 10 seconds after {{callDeActiveCallbacks()}} and 
before a new (pre)activation will register it again.

This stealthy behaviour has been captured by random failures on 
{{org.apache.activemq.artemis.tests.smoke.jmxfailback.JmxFailbackTest#testFailbackOnJMX}},
 due to some randomly blocked task in the thread pool.
The core issue that cause the thread pool to be randomly blocked was present by 
long time, but the unavailability time window of JMX introduced by the 
mentioned JIRAs was the change that has triggered the bomb.
The test check for 5 seconds the availability of backup JMX connection during a 
backup restart (on failback): given that the default thread pool await time is 
10 seconds, a longer thread pool stop will make the test to fail.

It seems by a thread dump inspection that the pending task is:

{code:java}
jmx-failback2-out:"Thread-1 
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@1f97cf0d)"
 Id=43 TIMED_WAITING on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
jmx-failback2-out:      at sun.misc.Unsafe.park(Native Method)
jmx-failback2-out:      -  waiting on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
jmx-failback2-out:      at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
jmx-failback2-out:      at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
jmx-failback2-out:      at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:504)
jmx-failback2-out:      -  locked java.lang.Object@607e79a2
jmx-failback2-out:      at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
jmx-failback2-out:      at 
org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80)
jmx-failback2-out:      at 
org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:271)
jmx-failback2-out:      at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
jmx-failback2-out:      at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
jmx-failback2-out:      at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
jmx-failback2-out:      at 
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$38/327040562.run(Unknown
 Source)
jmx-failback2-out:      at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
jmx-failback2-out:      at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
jmx-failback2-out:      at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
jmx-failback2-out:
jmx-failback2-out:      Number of locked synchronizers = 1
jmx-failback2-out:      - 
java.util.concurrent.ThreadPoolExecutor$Worker@6e676dc8
{code}

And indeed it seems that {{BackupManager::stop}} that's calling 
{{BackupConnector::close}} just {{closeLocator(backupServerLocator)}} that 
won't unblock {{clusterControl.authorize()}}.
A first fix would be to correctly unblock any blocking call to clean stop 
{{BackupManager}} and let the broker thread pool to immediately stop. 



> Timed out waiting pool stop on backup restart
> ---------------------------------------------
>
>                 Key: ARTEMIS-2958
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2958
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker, JMX
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Minor
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The changes on ARTEMIS-2823 have caused the AMQ thread pool stop on server 
> stop to be moved after {{callDeActiveCallbacks()}}, while the changes on 
> ARTEMIS-2838 to 
> {{server.getServer().getManagementService().unregisterHawtioSecurity()}} on 
> {{callDeActiveCallbacks()}} have moved the HawtioSecurity de-registration to 
> happen on server stop.
> it means that on server restart, if the thread pool has a slow stop, JMX 
> won't be available until a new start.
> The thread pool stop can block stopping if there is a long task still 
> running/blocked in the pool and the default strategy while stopping the 
> broker is to await 10 seconds before forcing a shutdown of the pending tasks 
> ie JMX can be unavailable for at least 10 seconds after 
> {{callDeActiveCallbacks()}} and before a new (pre)activation will register it 
> again.
> This stealthy behaviour has been captured by random failures on 
> {{org.apache.activemq.artemis.tests.smoke.jmxfailback.JmxFailbackTest#testFailbackOnJMX}},
>  due to some randomly blocked task in the thread pool.
> The core issue that cause the thread pool to be randomly blocked was present 
> by long time, but the unavailability time window of JMX introduced by the 
> mentioned JIRAs was the change that has triggered the bomb.
> The test check for 5 seconds the availability of backup JMX connection during 
> a backup restart (on failback): given that the default thread pool await time 
> is 10 seconds, a longer thread pool stop will make the test to fail.
> It seems by a thread dump inspection that the pending task is:
> {code:java}
> jmx-failback2-out:"Thread-1 
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@1f97cf0d)"
>  Id=43 TIMED_WAITING on 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
> jmx-failback2-out:      at sun.misc.Unsafe.park(Native Method)
> jmx-failback2-out:      -  waiting on 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
> jmx-failback2-out:      at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> jmx-failback2-out:      at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:504)
> jmx-failback2-out:      -  locked java.lang.Object@607e79a2
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:271)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$38/327040562.run(Unknown
>  Source)
> jmx-failback2-out:      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> jmx-failback2-out:      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> jmx-failback2-out:      at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> jmx-failback2-out:
> jmx-failback2-out:      Number of locked synchronizers = 1
> jmx-failback2-out:      - 
> java.util.concurrent.ThreadPoolExecutor$Worker@6e676dc8
> {code}
> And indeed it seems that {{BackupManager::stop}} that's calling 
> {{BackupConnector::close}} just {{closeLocator(backupServerLocator)}} that 
> won't unblock {{clusterControl.authorize()}}.
> A first fix would be to correctly unblock any blocking call to clean stop 
> {{BackupManager}} and let the broker thread pool to immediately stop. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARTEMIS-2958) Timed out waiting pool stop on backup restart

Reply via email to