[
https://issues.apache.org/jira/browse/QPIDJMS-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogalinski updated QPIDJMS-392:
------------------------------------
Attachment: logs-with-frames-2018-06-17.tar.gz
> Stalled Service Bus Subscriptions, Reconnect Loops
> --------------------------------------------------
>
> Key: QPIDJMS-392
> URL: https://issues.apache.org/jira/browse/QPIDJMS-392
> Project: Qpid JMS
> Issue Type: Bug
> Components: qpid-jms-client
> Affects Versions: 0.32.0
> Environment: * Azure App Service
> * Spring 1.5.12 via embedded web server / standalone
> * java.runtime.version "1.8.0_144-b01"
> * java.vm.version "25.144-b01"
> * java.vm.vendor "Azul Systems, Inc."
> * os.arch "amd64"
> * os.name "Windows Server 2016"
> Reporter: Paul Rogalinski
> Priority: Major
> Attachments: logs-with-frames-2018-06-17.tar.gz, qpid-log.txt
>
>
> We are experiencing connection recovery loops after an random (hours to days)
> amount of time on JMS connection subscribed to a Service Bus Topic via
> (durable) Subscription. The application does connect to two different service
> bus instances at the same time.
> We had to use the failover:(amqp:...) URL pattern in order to actually see
> something in the log files as the FailoverProvider is a bit more verbose
> (TRACE level). When using plain ampqs:// connection URL pattern the only way
> to tell a hanging/stalled subscription was to monitor the number of
> active/unconsumed Messages in a Service Bus' Subscription using a different
> client. Regardless of the Provider implementation, the problem of stalled
> subscriptions persisted.
> Log observations so far:
> After an initial disconnect:
> {code:java}
> o.a.qpid.jms.provider.amqp.AmqpProvider : New Proton Event:
> CONNECTION_REMOTE_CLOSE
> (...)
> Failover: the provider reports failure: The connection was inactive for more
> than the allowed 300000 milliseconds and is closed by container
> 'LinkTracker'. TrackingId:c2925ed17c9549b29a825cc581852d1a_G6,
> SystemTracker:gateway9, Timestamp:5/1/2018 6:11:44 PM [condition =
> amqp:connection:forced]
> {code}
> the application will enter into a reconnect/connection recovery loop with an
> interval of one minute.
> {code:java}
> 18:11:44.670Z Failover: the provider reports failure: The connection was
> inactive for more than the allowed 300000 milliseconds and is closed by
> container 'LinkTracker'. TrackingId:c2925ed17c9549b29a825cc581852d1a_G6,
> SystemTracker:gateway9, Timestamp:5/1/2018 6:11:44 PM [condition =
> amqp:connection:forced]
> 18:12:44.789Z Failover: the provider reports failure: The connection was
> inactive for more than the allowed 60000 milliseconds and is closed by
> container 'LinkTracker'. TrackingId:50b008fdb56f443b947d76ded87691a9_G5,
> SystemTracker:gateway9, Timestamp:5/1/2018 6:12:44 PM [condition =
> amqp:connection:forced]
> 18:13:44.918Z Failover: the provider reports failure: The connection was
> inactive for more than the allowed 60000 milliseconds and is closed by
> container 'LinkTracker'. TrackingId:a2832475e8ab4702b6189d9218588132_G15,
> SystemTracker:gateway9, Timestamp:5/1/2018 6:13:44 PM [condition =
> amqp:connection:forced]
> {code}
> In between we see the connection being recovered along with log messages as:
> {code:java}
> org.apache.qpid.jms.JmsConnection : Connection
> ID:8061906b-13e9-46a3-b6d0-fd227b339f67:1 is finalizing recovery
> {code}
> followed by
> {code:java}
> o.a.qpid.jms.provider.amqp.AmqpProvider : New Proton Event:
> SESSION_REMOTE_CLOSE
> {code}
> after a minute.
> A full cycle is attached in the log file, URLs are anonymized.
> In general we can see the following sequence of Proton-Events repeating:
> {code:java}
> 23:58:53.676Z CONNECTION_INIT
> 23:58:53.722Z CONNECTION_LOCAL_OPEN
> 23:58:53.722Z CONNECTION_REMOTE_OPEN
> 23:58:53.722Z SESSION_INIT
> 23:58:53.722Z SESSION_LOCAL_OPEN
> 23:58:53.738Z SESSION_REMOTE_OPEN
> 23:59:53.690Z SESSION_REMOTE_CLOSE
> 23:59:53.690Z SESSION_LOCAL_CLOSE
> 23:59:53.690Z SESSION_FINAL
> 23:59:53.706Z CONNECTION_REMOTE_CLOSE
> 23:59:53.706Z TRANSPORT_TAIL_CLOSED
> 23:59:53.706Z CONNECTION_LOCAL_CLOSE
> 23:59:53.706Z TRANSPORT_HEAD_CLOSED
> 23:59:53.706Z TRANSPORT_CLOSED
> {code}
> We did also open a ticket with Microsoft / Azure as the disconnects seemed to
> originate from the Service Bus instance (SESSION_REMOTE_CLOSE) where we did
> learn, that the Service Bus has a hardwired TTL of 5 minutes before inactive
> Subscriptions will be terminated. According to Azure-Support the issue is
> most likely related to Qpid connection handling. This does at least explain
> the first / initial timeout of 300000 seen in the logs. The connection /
> client is set up as follows
> URLs:
> {code:java}
> failover:(amqps://instance1.servicebus.windows.net)?amqp.idleTimeout=12000
> failover:(amqps://instance2.servicebus.windows.net)?amqp.idleTimeout=12000
> {code}
> Connection Factories:
> {code:java}
> JmsConnectionFactory qpidConnectionFactory = new
> JmsConnectionFactory(sbUsername, sbPassword, sbAmqpRemoteUri);
> qpidConnectionFactory.setReceiveLocalOnly(true);
> qpidConnectionFactory.setClientID(UUID.randomUUID().toString()); {code}
> JmsListenerContainerFactories:
> {code:java}
> DefaultJmsListenerContainerFactory jmsListenerContainerFactory = new
> DefaultJmsListenerContainerFactory();
> jmsListenerContainerFactory.setConnectionFactory(jmsConnectionFactory);
> jmsListenerContainerFactory.setPubSubDomain(true);
> jmsListenerContainerFactory.setSubscriptionDurable(true);
> jmsListenerContainerFactory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
> return jmsListenerContainerFactory;{code}
> And Spring-JMS Listeners:
> {code:java}
> @Component
> public class MyListener {
> @Autowired
> JMSSupport jmsSupport;
>
> @JmsListener(containerFactory = "myContainerInstance", subscription =
> "mySubscriptionName", destination = "myTopicName")
> public void onMessage(Message myMessage) throws JMSException {
> jmsSupport.consumeMessage(message -> {
> // business logic consuming the message goes here.
> }, myMessage, TextMessage.class);
> }
> }
>
> @Component
> public class JMSSupport {
> public <T extends Message> void consumeMessage(Consumer<T>
> messageConsumer, Message message, Class<T> acceptedMessageType) throws
> JMSException {
> try {
> // message validation goes here
> messageConsumer.accept((T) message);
> } catch (Exception ex) {
> message.setIntProperty(JmsMessageSupport.JMS_AMQP_ACK_TYPE,
> JmsMessageSupport.MODIFIED_FAILED);
> throw ex;
> } finally {
> message.acknowledge();
> }
> }
> }
> {code}
> In general, Qpid seems to be able to handle refreshing the IdleTimeout using
> half of the time configured which is configured in the URL.
> We could not yet find any explanation why after an random amount of time once
> of three subscriptions gets into the state illustrated above while other two
> subscriptions continue working. Please note that in between the reconnect
> loops, the re-established connection / subscription will not receive any
> pending messages.
> What is implemented differently from the most common scenarios?
> * use CLIENT_ACKNOWLEDGE mode
> What attempts have been made to work around or reproduce the issue (none
> worked)?
> * We have tried run the client in a docker container and introduce network
> packet loss (tc) - could not provoke the issue yet.
> * switch from DefaultJmsListenerContainerFactory to SimpleJmsContainerFactory
> * use a dedicated JmsListenerContainerFactory per subscription, even on the
> same server
> * use failover:(amqps://) instead of amqps://
> * recently we did learn, that it makes no difference if the client connects
> to one or two Service Bus instances, the problem does also show up in 1 JMS
> Client -> 1 Service Bus Server -> 3 Subscription scenarios. Following that
> path might be a red herring.
> * The amount of time a subscription is in an idle state does't seem to have
> any effect on the issue. Or at least there seems not to be any difference
> between idle time of few minutes vs. idle time of days. We will be monitoring
> that more closely in the future.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]