[ 
https://issues.apache.org/jira/browse/ARTEMIS-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363198#comment-17363198
 ] 

ASF subversion and git services commented on ARTEMIS-3337:
----------------------------------------------------------

Commit 65782cb4578aaf85c47c2e4774203b133ebd7713 in activemq-artemis's branch 
refs/heads/main from Domenico Francesco Bruscino
[ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=65782cb ]

ARTEMIS-3337 Add test on multiple connection failures


> CORE Client sessions are closed when there are errors during failover
> ---------------------------------------------------------------------
>
>                 Key: ARTEMIS-3337
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3337
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.17.0
>         Environment: {noformat}
>  {noformat}
>            Reporter: Markus Meierhofer
>            Priority: Major
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> h3. The issue
> When a CORE client does failover to a new connection, it usually transfers 
> all sessions from the old to the new connection after it's able to establish 
> the new connection. When the session transfer to the new connection fails, it 
> assumes the new connection is also broken and does the same steps again (get 
> new connection, transfer all sessions) until the reconnectAttempts have been 
> reached. But after this failed session transfer, there seems to be a bug that 
> sessions can get closed.
>  
> One occurrence of the bug:
> A client has lost connection due to bad Wifi and tried to reconnect. After 
> reconnect, the sessions were transferred, but one of the sessions couldn't be 
> transferred (Recognizable by the "failed to handle failover" error):
> {noformat}
> [ERROR 2021-05-27 00:24:08,175 l-threads)  
> org.apache.activemq.artemis.core.client]: AMQ214003: Failed to handle failover
> org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: 
> AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 
> 32
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.reattachOnNewConnection(ActiveMQSessionContext.java:836)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.handleFailover(ClientSessionImpl.java:1393)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.reconnectSessions(ClientSessionFactoryImpl.java:806)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:638)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118){noformat}
> Afterwards, although the failover was still ongoing, a producer tried to send 
> a message, but got the error "Connection is destroyed". This error is only 
> printed if a session's underlying ChannelImpl is closed:
> {noformat}
> [ERROR 2021-05-27 00:24:15,526 l-threads) 
> n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call 
> reply! (Request queue: robot_2001.robot_command, Reply queue: 
> ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: 
> RobotCommandRequest(info=SlotCommandInfo(super=CommandInfo(source=SYSTEM, 
> id=1486607), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true 
> LhdOptions( z=0.6) timeout=1800), 
> transferOptions=TransferOptions(SECONDARY_SLOT loadUnits=[LoadUnit(code=, 
> orderId=1486594, verify=false, empty=false, size=1)]), goalName=LMC_H1), 
> type=unload_at, timestamp=2021-05-26T22:24:03.450405Z),Reply to send: 
> RobotCommandResponse(id=1486607, result=REQUEST_TOO_OLD)):
> [ERROR 2021-05-27 00:24:15,526 l-threads) 
> n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at 
> com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at 
> org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: 
> org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: 
> AMQ219010: Connection is destroyed
>         ... 21 common frames omitted{noformat}
> ~12 seconds later, the connection "successfully" failovered:
> {noformat}
> [ERROR 2021-05-27 00:24:27,948 ad-6433287 
> nt.control.RestartAgentExceptionListener]: JMS exception listener retrieved 
> exception!
> javax.jms.JMSException: 
> ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT 
> message=AMQ219014: Timed out after waiting 4,000 ms for response when sending 
> packet 45]
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:750)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:771)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:736)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:724)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:680)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: 
> org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: 
> AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 
> 45
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.queueQuery(ActiveMQSessionContext.java:359)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.queueQuery(ClientSessionImpl.java:780)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:409)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at 
> com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at 
> org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         ... 6 common frames omitted
> [ERROR 2021-05-27 00:24:27,948 ad-6433287 
> nt.control.RestartAgentExceptionListener]: JMS exception error code is: 
> FAILOVER{noformat}
> But even after the failover, every time a message was tried to be sent on 
> that session, it failed with "Connection is destroyed". Other sessions worked 
> fine after the failover:
> {noformat}
> [ERROR 2021-05-27 00:24:28,511 l-threads) 
> n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call 
> reply! (Request queue: robot_2001.robot_command, Reply queue: 
> ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: 
> RobotCommandRequest(info=GotoCommandInfo(super=CommandInfo(source=SYSTEM, 
> id=1487919), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true 
> LhdOptions( z=0.797) timeout=1800), goalName=WH_Conveyor_Box_Put), type=goto, 
> timestamp=2021-05-26T22:24:28.508373Z),Reply to send: 
> RobotCommandResponse(id=1487919, result=SUCCESS)):
> [ERROR 2021-05-27 00:24:28,511 l-threads) 
> n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at 
> com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at 
> org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: 
> org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: 
> AMQ219010: Connection is destroyed
>         ... 21 common frames omitted
> {noformat}
> h3. Cause of the issue
> After reconnect, every session's "handleFailover" method is called to 
> transfer the session to the new connection. But if handleFailover fails for 
> one session, the reconnect immediately returns and tries again:
>  
> {noformat}
> for (ClientSessionInternal session : sessionsToFailover) {
>          if (!session.handleFailover(connection, cause)) {
>             return false;
>          }
> }{noformat}
>  
> The issue here is that "handleFailover" is not executed for every session:
>  * During handleFailover, "transferConnection" is called on the session's 
> ChannelImpl, which removes the session's channel from the old connection and 
> puts it onto the new one. If transferConnection is not executed for a 
> session, this session's channel stays on the old connection.
>  * During connection destroy, every channel set on that connection is closed 
> via Channel.close
>  * Now if one of the sessions wouldn't successfully execute handleFailover, 
> the method would immediately return and not remove the remaining session's 
> channels from the old connection. And as before reconnecting, the old 
> connection (which still contains some session channels) is destroyed, those 
> session's channels are also closed.
> Even if handleFailover is afterwards successfully executed on those sessions, 
> the "closed" flag is already set on their channels, so they will reject 
> future sends.
> h3. Possible solution
> It needs to be ensured that all session's channels are transferred (=removed 
> from the old connection) before the old connection is destroyed, even if an 
> earlier session failed to be transferred on the server-side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to