[ 
https://issues.apache.org/jira/browse/QPIDJMS-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogalinski updated QPIDJMS-392:
------------------------------------
    Attachment: logs-with-frames-2018-06-17.tar.gz

> Stalled Service Bus Subscriptions, Reconnect Loops
> --------------------------------------------------
>
>                 Key: QPIDJMS-392
>                 URL: https://issues.apache.org/jira/browse/QPIDJMS-392
>             Project: Qpid JMS
>          Issue Type: Bug
>          Components: qpid-jms-client
>    Affects Versions: 0.32.0
>         Environment: * Azure App Service
>  * Spring 1.5.12 via embedded web server / standalone
>  * java.runtime.version "1.8.0_144-b01"
>  * java.vm.version "25.144-b01"
>  * java.vm.vendor "Azul Systems, Inc."
>  * os.arch "amd64"
>  * os.name "Windows Server 2016"
>            Reporter: Paul Rogalinski
>            Priority: Major
>         Attachments: logs-with-frames-2018-06-17.tar.gz, qpid-log.txt
>
>
> We are experiencing connection recovery loops after an random (hours to days) 
> amount of time on JMS connection subscribed to a Service Bus Topic via 
> (durable) Subscription. The application does connect to two different service 
> bus instances at the same time.
> We had to use the failover:(amqp:...) URL pattern in order to actually see 
> something in the log files as the FailoverProvider is a bit more verbose 
> (TRACE level). When using plain ampqs:// connection URL pattern the only way 
> to tell a hanging/stalled subscription was to monitor the number of 
> active/unconsumed Messages in a Service Bus' Subscription using a different 
> client. Regardless of the Provider implementation, the problem of stalled 
> subscriptions persisted.
> Log observations so far:
> After an initial disconnect:
> {code:java}
> o.a.qpid.jms.provider.amqp.AmqpProvider  : New Proton Event: 
> CONNECTION_REMOTE_CLOSE
> (...)
> Failover: the provider reports failure: The connection was inactive for more 
> than the allowed 300000 milliseconds and is closed by container 
> 'LinkTracker'. TrackingId:c2925ed17c9549b29a825cc581852d1a_G6, 
> SystemTracker:gateway9, Timestamp:5/1/2018 6:11:44 PM [condition = 
> amqp:connection:forced]
> {code}
> the application will enter into a reconnect/connection recovery loop with an 
> interval of one minute.
> {code:java}
> 18:11:44.670Z Failover: the provider reports failure: The connection was 
> inactive for more than the allowed 300000 milliseconds and is closed by 
> container 'LinkTracker'. TrackingId:c2925ed17c9549b29a825cc581852d1a_G6, 
> SystemTracker:gateway9, Timestamp:5/1/2018 6:11:44 PM [condition = 
> amqp:connection:forced]
> 18:12:44.789Z Failover: the provider reports failure: The connection was 
> inactive for more than the allowed 60000 milliseconds and is closed by 
> container 'LinkTracker'. TrackingId:50b008fdb56f443b947d76ded87691a9_G5, 
> SystemTracker:gateway9, Timestamp:5/1/2018 6:12:44 PM [condition = 
> amqp:connection:forced]
> 18:13:44.918Z Failover: the provider reports failure: The connection was 
> inactive for more than the allowed 60000 milliseconds and is closed by 
> container 'LinkTracker'. TrackingId:a2832475e8ab4702b6189d9218588132_G15, 
> SystemTracker:gateway9, Timestamp:5/1/2018 6:13:44 PM [condition = 
> amqp:connection:forced]
> {code}
> In between we see the connection being recovered along with log messages as:
> {code:java}
> org.apache.qpid.jms.JmsConnection        : Connection 
> ID:8061906b-13e9-46a3-b6d0-fd227b339f67:1 is finalizing recovery
> {code}
> followed by
> {code:java}
> o.a.qpid.jms.provider.amqp.AmqpProvider  : New Proton Event: 
> SESSION_REMOTE_CLOSE
> {code}
> after a minute.
> A full cycle is attached in the log file, URLs are anonymized.
> In general we can see the following sequence of Proton-Events repeating:
> {code:java}
> 23:58:53.676Z CONNECTION_INIT
> 23:58:53.722Z CONNECTION_LOCAL_OPEN
> 23:58:53.722Z CONNECTION_REMOTE_OPEN
> 23:58:53.722Z SESSION_INIT
> 23:58:53.722Z SESSION_LOCAL_OPEN
> 23:58:53.738Z SESSION_REMOTE_OPEN
> 23:59:53.690Z SESSION_REMOTE_CLOSE
> 23:59:53.690Z SESSION_LOCAL_CLOSE
> 23:59:53.690Z SESSION_FINAL
> 23:59:53.706Z CONNECTION_REMOTE_CLOSE
> 23:59:53.706Z TRANSPORT_TAIL_CLOSED
> 23:59:53.706Z CONNECTION_LOCAL_CLOSE
> 23:59:53.706Z TRANSPORT_HEAD_CLOSED
> 23:59:53.706Z TRANSPORT_CLOSED
> {code}
> We did also open a ticket with Microsoft / Azure as the disconnects seemed to 
> originate from the Service Bus instance (SESSION_REMOTE_CLOSE) where we did 
> learn, that the Service Bus has a hardwired TTL of 5 minutes before inactive 
> Subscriptions will be terminated. According to Azure-Support the issue is 
> most likely related to Qpid connection handling. This does at least explain 
> the first / initial timeout of 300000 seen in the logs. The connection / 
> client is set up as follows
> URLs:
> {code:java}
> failover:(amqps://instance1.servicebus.windows.net)?amqp.idleTimeout=12000
> failover:(amqps://instance2.servicebus.windows.net)?amqp.idleTimeout=12000
> {code}
> Connection Factories:
> {code:java}
> JmsConnectionFactory qpidConnectionFactory = new 
> JmsConnectionFactory(sbUsername, sbPassword, sbAmqpRemoteUri);
>  qpidConnectionFactory.setReceiveLocalOnly(true);
> qpidConnectionFactory.setClientID(UUID.randomUUID().toString()); {code}
> JmsListenerContainerFactories:
> {code:java}
> DefaultJmsListenerContainerFactory jmsListenerContainerFactory = new 
> DefaultJmsListenerContainerFactory();
> jmsListenerContainerFactory.setConnectionFactory(jmsConnectionFactory);
> jmsListenerContainerFactory.setPubSubDomain(true);
> jmsListenerContainerFactory.setSubscriptionDurable(true);
> jmsListenerContainerFactory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
> return jmsListenerContainerFactory;{code}
> And Spring-JMS Listeners:
> {code:java}
> @Component
> public class MyListener {
>     @Autowired
>     JMSSupport jmsSupport;
>  
>     @JmsListener(containerFactory = "myContainerInstance", subscription = 
> "mySubscriptionName", destination = "myTopicName")
>     public void onMessage(Message myMessage) throws JMSException {
>         jmsSupport.consumeMessage(message -> {
>             // business logic consuming the message goes here.
>         }, myMessage, TextMessage.class);
>     }
> }
>  
> @Component
> public class JMSSupport {
>     public <T extends Message> void consumeMessage(Consumer<T> 
> messageConsumer, Message message, Class<T> acceptedMessageType) throws 
> JMSException {
>         try {
>             // message validation goes here
>             messageConsumer.accept((T) message);
>         } catch (Exception ex) {
>             message.setIntProperty(JmsMessageSupport.JMS_AMQP_ACK_TYPE, 
> JmsMessageSupport.MODIFIED_FAILED);
>             throw ex;
>         } finally {
>             message.acknowledge();
>         }
>     }
> }
> {code}
> In general, Qpid seems to be able to handle refreshing the IdleTimeout using 
> half of the time configured which is configured in the URL.
> We could not yet find any explanation why after an random amount of time once 
> of three subscriptions gets into the state illustrated above while other two 
> subscriptions continue working. Please note that in between the reconnect 
> loops, the re-established connection / subscription will not receive any 
> pending messages. 
> What is implemented differently from the most common scenarios?
>  * use CLIENT_ACKNOWLEDGE mode
> What attempts have been made to work around or reproduce the issue (none 
> worked)?
>  * We have tried run the client in a docker container and introduce network 
> packet loss (tc) - could not provoke the issue yet.
>  * switch from DefaultJmsListenerContainerFactory to SimpleJmsContainerFactory
>  * use a dedicated JmsListenerContainerFactory per subscription, even on the 
> same server
>  * use failover:(amqps://) instead of amqps://
>  * recently we did learn, that it makes no difference if the client connects 
> to one or two Service Bus instances, the problem does also show up in 1 JMS 
> Client -> 1 Service Bus Server -> 3 Subscription scenarios. Following that 
> path might be a red herring.
>  * The amount of time a subscription is in an idle state does't seem to have 
> any effect on the issue. Or at least there seems not to be any difference 
> between idle time of few minutes vs. idle time of days. We will be monitoring 
> that more closely in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to