[
https://issues.apache.org/jira/browse/ARTEMIS-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363198#comment-17363198
]
ASF subversion and git services commented on ARTEMIS-3337:
----------------------------------------------------------
Commit 65782cb4578aaf85c47c2e4774203b133ebd7713 in activemq-artemis's branch
refs/heads/main from Domenico Francesco Bruscino
[ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=65782cb ]
ARTEMIS-3337 Add test on multiple connection failures
> CORE Client sessions are closed when there are errors during failover
> ---------------------------------------------------------------------
>
> Key: ARTEMIS-3337
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3337
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 2.17.0
> Environment: {noformat}
> {noformat}
> Reporter: Markus Meierhofer
> Priority: Major
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> h3. The issue
> When a CORE client does failover to a new connection, it usually transfers
> all sessions from the old to the new connection after it's able to establish
> the new connection. When the session transfer to the new connection fails, it
> assumes the new connection is also broken and does the same steps again (get
> new connection, transfer all sessions) until the reconnectAttempts have been
> reached. But after this failed session transfer, there seems to be a bug that
> sessions can get closed.
>
> One occurrence of the bug:
> A client has lost connection due to bad Wifi and tried to reconnect. After
> reconnect, the sessions were transferred, but one of the sessions couldn't be
> transferred (Recognizable by the "failed to handle failover" error):
> {noformat}
> [ERROR 2021-05-27 00:24:08,175 l-threads)
> org.apache.activemq.artemis.core.client]: AMQ214003: Failed to handle failover
> org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException:
> AMQ219014: Timed out after waiting 4,000 ms for response when sending packet
> 32
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.reattachOnNewConnection(ActiveMQSessionContext.java:836)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.handleFailover(ClientSessionImpl.java:1393)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.reconnectSessions(ClientSessionFactoryImpl.java:806)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:638)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
> at
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
> at
> org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
> at
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
> at
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> at
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118){noformat}
> Afterwards, although the failover was still ongoing, a producer tried to send
> a message, but got the error "Connection is destroyed". This error is only
> printed if a session's underlying ChannelImpl is closed:
> {noformat}
> [ERROR 2021-05-27 00:24:15,526 l-threads)
> n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call
> reply! (Request queue: robot_2001.robot_command, Reply queue:
> ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request:
> RobotCommandRequest(info=SlotCommandInfo(super=CommandInfo(source=SYSTEM,
> id=1486607), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true
> LhdOptions( z=0.6) timeout=1800),
> transferOptions=TransferOptions(SECONDARY_SLOT loadUnits=[LoadUnit(code=,
> orderId=1486594, verify=false, empty=false, size=1)]), goalName=LMC_H1),
> type=unload_at, timestamp=2021-05-26T22:24:03.450405Z),Reply to send:
> RobotCommandResponse(id=1486607, result=REQUEST_TOO_OLD)):
> [ERROR 2021-05-27 00:24:15,526 l-threads)
> n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
> at
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
> at
> com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
> at
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
> at
> org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
> at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> at
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by:
> org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException:
> AMQ219010: Connection is destroyed
> ... 21 common frames omitted{noformat}
> ~12 seconds later, the connection "successfully" failovered:
> {noformat}
> [ERROR 2021-05-27 00:24:27,948 ad-6433287
> nt.control.RestartAgentExceptionListener]: JMS exception listener retrieved
> exception!
> javax.jms.JMSException:
> ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT
> message=AMQ219014: Timed out after waiting 4,000 ms for response when sending
> packet 45]
> at
> org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:750)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:771)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:736)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:724)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:680)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
> at
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
> at
> org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
> at
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
> at
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> at
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by:
> org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException:
> AMQ219014: Timed out after waiting 4,000 ms for response when sending packet
> 45
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.queueQuery(ActiveMQSessionContext.java:359)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.queueQuery(ClientSessionImpl.java:780)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:409)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
> at
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
> at
> com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
> at
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
> at
> org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
> ... 6 common frames omitted
> [ERROR 2021-05-27 00:24:27,948 ad-6433287
> nt.control.RestartAgentExceptionListener]: JMS exception error code is:
> FAILOVER{noformat}
> But even after the failover, every time a message was tried to be sent on
> that session, it failed with "Connection is destroyed". Other sessions worked
> fine after the failover:
> {noformat}
> [ERROR 2021-05-27 00:24:28,511 l-threads)
> n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call
> reply! (Request queue: robot_2001.robot_command, Reply queue:
> ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request:
> RobotCommandRequest(info=GotoCommandInfo(super=CommandInfo(source=SYSTEM,
> id=1487919), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true
> LhdOptions( z=0.797) timeout=1800), goalName=WH_Conveyor_Box_Put), type=goto,
> timestamp=2021-05-26T22:24:28.508373Z),Reply to send:
> RobotCommandResponse(id=1487919, result=SUCCESS)):
> [ERROR 2021-05-27 00:24:28,511 l-threads)
> n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
> at
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
> at
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
> at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
> at
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
> at
> com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
> at
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
> at
> org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
> at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> at
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by:
> org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException:
> AMQ219010: Connection is destroyed
> ... 21 common frames omitted
> {noformat}
> h3. Cause of the issue
> After reconnect, every session's "handleFailover" method is called to
> transfer the session to the new connection. But if handleFailover fails for
> one session, the reconnect immediately returns and tries again:
>
> {noformat}
> for (ClientSessionInternal session : sessionsToFailover) {
> if (!session.handleFailover(connection, cause)) {
> return false;
> }
> }{noformat}
>
> The issue here is that "handleFailover" is not executed for every session:
> * During handleFailover, "transferConnection" is called on the session's
> ChannelImpl, which removes the session's channel from the old connection and
> puts it onto the new one. If transferConnection is not executed for a
> session, this session's channel stays on the old connection.
> * During connection destroy, every channel set on that connection is closed
> via Channel.close
> * Now if one of the sessions wouldn't successfully execute handleFailover,
> the method would immediately return and not remove the remaining session's
> channels from the old connection. And as before reconnecting, the old
> connection (which still contains some session channels) is destroyed, those
> session's channels are also closed.
> Even if handleFailover is afterwards successfully executed on those sessions,
> the "closed" flag is already set on their channels, so they will reject
> future sends.
> h3. Possible solution
> It needs to be ensured that all session's channels are transferred (=removed
> from the old connection) before the old connection is destroyed, even if an
> earlier session failed to be transferred on the server-side.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)