Markus Meierhofer created ARTEMIS-3337:
------------------------------------------
Summary: CORE Client sessions are closed when there are errors
during failover
Key: ARTEMIS-3337
URL: https://issues.apache.org/jira/browse/ARTEMIS-3337
Project: ActiveMQ Artemis
Issue Type: Bug
Affects Versions: 2.17.0
Environment: {noformat}
{noformat}
Reporter: Markus Meierhofer
h3. The issue
When a CORE client does failover to a new connection, it usually transfers all
sessions from the old to the new connection after it's able to establish the
new connection. When the session transfer to the new connection fails, it
assumes the new connection is also broken and does the same steps again (get
new connection, transfer all sessions) until the reconnectAttempts have been
reached. But after this failed session transfer, there seems to be a bug that
sessions can get closed.
One occurrence of the bug:
A client has lost connection due to bad Wifi and tried to reconnect. After
reconnect, the sessions were transferred, but one of the sessions couldn't be
transferred (Recognizable by the "failed to handle failover" error):
{noformat}
[ERROR 2021-05-27 00:24:08,175 l-threads)
org.apache.activemq.artemis.core.client]: AMQ214003: Failed to handle failover
org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException:
AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 32
at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
at
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.reattachOnNewConnection(ActiveMQSessionContext.java:836)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.handleFailover(ClientSessionImpl.java:1393)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.reconnectSessions(ClientSessionFactoryImpl.java:806)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:638)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
at
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
at
org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
at
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
at
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118){noformat}
Afterwards, although the failover was still ongoing, a producer tried to send a
message, but got the error "Connection is destroyed". This error is only
printed if a session's underlying ChannelImpl is closed:
{noformat}
[ERROR 2021-05-27 00:24:15,526 l-threads)
n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call
reply! (Request queue: robot_2001.robot_command, Reply queue:
ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request:
RobotCommandRequest(info=SlotCommandInfo(super=CommandInfo(source=SYSTEM,
id=1486607), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true
LhdOptions( z=0.6) timeout=1800),
transferOptions=TransferOptions(SECONDARY_SLOT loadUnits=[LoadUnit(code=,
orderId=1486594, verify=false, empty=false, size=1)]), goalName=LMC_H1),
type=unload_at, timestamp=2021-05-26T22:24:03.450405Z),Reply to send:
RobotCommandResponse(id=1486607, result=REQUEST_TOO_OLD)):
[ERROR 2021-05-27 00:24:15,526 l-threads)
n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
javax.jms.JMSException: AMQ219010: Connection is destroyed
at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
at
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
at
org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
at
com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
at
com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
at
com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
at
org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
at
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
at
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
at
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Caused by: org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException:
AMQ219010: Connection is destroyed
... 21 common frames omitted{noformat}
~12 seconds later, the connection "successfully" failovered:
{noformat}
[ERROR 2021-05-27 00:24:27,948 ad-6433287
nt.control.RestartAgentExceptionListener]: JMS exception listener retrieved
exception!
javax.jms.JMSException:
ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT
message=AMQ219014: Timed out after waiting 4,000 ms for response when sending
packet 45]
at
org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:750)
at
org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:771)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:736)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:724)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:680)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
at
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
at
org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
at
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
at
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Caused by:
org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException:
AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 45
at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
at
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.queueQuery(ActiveMQSessionContext.java:359)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.queueQuery(ClientSessionImpl.java:780)
at
org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:409)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
at
com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
at
com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
at
com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
at
org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
at
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
at
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
at
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
... 6 common frames omitted
[ERROR 2021-05-27 00:24:27,948 ad-6433287
nt.control.RestartAgentExceptionListener]: JMS exception error code is:
FAILOVER{noformat}
But even after the failover, every time a message was tried to be sent on that
session, it failed with "Connection is destroyed". Other sessions worked fine
after the failover:
{noformat}
[ERROR 2021-05-27 00:24:28,511 l-threads)
n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call
reply! (Request queue: robot_2001.robot_command, Reply queue:
ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request:
RobotCommandRequest(info=GotoCommandInfo(super=CommandInfo(source=SYSTEM,
id=1487919), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true
LhdOptions( z=0.797) timeout=1800), goalName=WH_Conveyor_Box_Put), type=goto,
timestamp=2021-05-26T22:24:28.508373Z),Reply to send:
RobotCommandResponse(id=1487919, result=SUCCESS)):
[ERROR 2021-05-27 00:24:28,511 l-threads)
n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
javax.jms.JMSException: AMQ219010: Connection is destroyed
at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
at
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
at
org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
at
com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
at
com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
at
com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
at
org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
at
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
at
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
at
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Caused by: org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException:
AMQ219010: Connection is destroyed
... 21 common frames omitted
{noformat}
h3. Cause of the issue
After reconnect, every session's "handleFailover" method is called to transfer
the session to the new connection. But if handleFailover fails for one session,
the reconnect immediately returns and tries again:
{noformat}
for (ClientSessionInternal session : sessionsToFailover) {
if (!session.handleFailover(connection, cause)) {
return false;
}
}{noformat}
The issue here is that "handleFailover" is not executed for every session:
* During handleFailover, "transferConnection" is called on the session's
ChannelImpl, which removes the session's channel from the old connection and
puts it onto the new one. If transferConnection is not executed for a session,
this session's channel stays on the old connection.
* During connection destroy, every channel set on that connection is closed
via Channel.close
* Now if one of the sessions wouldn't successfully execute handleFailover, the
method would immediately return and not remove the remaining session's channels
from the old connection. And as before reconnecting, the old connection (which
still contains some session channels) is destroyed, those session's channels
are also closed.
Even if handleFailover is afterwards successfully executed on those sessions,
the "closed" flag is already set on their channels, so they will reject future
sends.
h3. Possible solution
It needs to be ensured that all session's channels are transferred (=removed
from the old connection) before the old connection is destroyed, even if an
earlier session failed to be transferred on the server-side.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)