[
https://issues.apache.org/jira/browse/QPIDJMS-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogalinski updated QPIDJMS-392:
------------------------------------
Description:
We are experiencing connection recovery loops after an random (hours to days)
amount of time on JMS connection subscribed to a Service Bus Topic via
(durable) Subscription. The application does connect to two different service
bus instances at the same time.
We had to use the failover:(amqp:...) URL pattern in order to actually see
something in the log files as the FailoverProvider is a bit more verbose (TRACE
level). When using plain ampqs:// connection URL pattern the only way to tell a
hanging/stalled subscription was to monitor the number of active/unconsumed
Messages in a Service Bus' Subscription using a different client. Regardless of
the Provider implementation, the problem of stalled subscriptions persisted.
Log observations so far:
After an initial disconnect:
{code:java}
o.a.qpid.jms.provider.amqp.AmqpProvider : New Proton Event:
CONNECTION_REMOTE_CLOSE
(...)
Failover: the provider reports failure: The connection was inactive for more
than the allowed 300000 milliseconds and is closed by container 'LinkTracker'.
TrackingId:c2925ed17c9549b29a825cc581852d1a_G6, SystemTracker:gateway9,
Timestamp:5/1/2018 6:11:44 PM [condition = amqp:connection:forced]
{code}
the application will enter into a reconnect/connection recovery loop with an
interval of one minute.
{code:java}
18:11:44.670Z Failover: the provider reports failure: The connection was
inactive for more than the allowed 300000 milliseconds and is closed by
container 'LinkTracker'. TrackingId:c2925ed17c9549b29a825cc581852d1a_G6,
SystemTracker:gateway9, Timestamp:5/1/2018 6:11:44 PM [condition =
amqp:connection:forced]
18:12:44.789Z Failover: the provider reports failure: The connection was
inactive for more than the allowed 60000 milliseconds and is closed by
container 'LinkTracker'. TrackingId:50b008fdb56f443b947d76ded87691a9_G5,
SystemTracker:gateway9, Timestamp:5/1/2018 6:12:44 PM [condition =
amqp:connection:forced]
18:13:44.918Z Failover: the provider reports failure: The connection was
inactive for more than the allowed 60000 milliseconds and is closed by
container 'LinkTracker'. TrackingId:a2832475e8ab4702b6189d9218588132_G15,
SystemTracker:gateway9, Timestamp:5/1/2018 6:13:44 PM [condition =
amqp:connection:forced]
{code}
In between we see the connection being recovered along with log messages as:
{code:java}
org.apache.qpid.jms.JmsConnection : Connection
ID:8061906b-13e9-46a3-b6d0-fd227b339f67:1 is finalizing recovery
{code}
followed by
{code:java}
o.a.qpid.jms.provider.amqp.AmqpProvider : New Proton Event:
SESSION_REMOTE_CLOSE
{code}
after a minute.
A full cycle is attached in the log file, URLs are anonymized.
In general we can see the following sequence of Proton-Events repeating:
{code:java}
23:58:53.676Z CONNECTION_INIT
23:58:53.722Z CONNECTION_LOCAL_OPEN
23:58:53.722Z CONNECTION_REMOTE_OPEN
23:58:53.722Z SESSION_INIT
23:58:53.722Z SESSION_LOCAL_OPEN
23:58:53.738Z SESSION_REMOTE_OPEN
23:59:53.690Z SESSION_REMOTE_CLOSE
23:59:53.690Z SESSION_LOCAL_CLOSE
23:59:53.690Z SESSION_FINAL
23:59:53.706Z CONNECTION_REMOTE_CLOSE
23:59:53.706Z TRANSPORT_TAIL_CLOSED
23:59:53.706Z CONNECTION_LOCAL_CLOSE
23:59:53.706Z TRANSPORT_HEAD_CLOSED
23:59:53.706Z TRANSPORT_CLOSED
{code}
We did also open a ticket with Microsoft / Azure as the disconnects seemed to
originate from the Service Bus instance (SESSION_REMOTE_CLOSE) where we did
learn, that the Service Bus has a hardwired TTL of 5 minutes before inactive
Subscriptions will be terminated. According to Azure-Support the issue is most
likely related to Qpid connection handling. This does at least explain the
first / initial timeout of 300000 seen in the logs. The connection / client is
set up as follows
URLs:
{code:java}
failover:(amqps://instance1.servicebus.windows.net)?amqp.idleTimeout=12000
failover:(amqps://instance2.servicebus.windows.net)?amqp.idleTimeout=12000
{code}
Connection Factories:
{code:java}
JmsConnectionFactory qpidConnectionFactory = new
JmsConnectionFactory(sbUsername, sbPassword, sbAmqpRemoteUri);
qpidConnectionFactory.setReceiveLocalOnly(true);
qpidConnectionFactory.setClientID(UUID.randomUUID().toString()); {code}
JmsListenerContainerFactories:
{code:java}
DefaultJmsListenerContainerFactory jmsListenerContainerFactory = new
DefaultJmsListenerContainerFactory();
jmsListenerContainerFactory.setConnectionFactory(jmsConnectionFactory);
jmsListenerContainerFactory.setPubSubDomain(true);
jmsListenerContainerFactory.setSubscriptionDurable(true);
jmsListenerContainerFactory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
return jmsListenerContainerFactory;{code}
And Spring-JMS Listeners:
{code:java}
@Component
public class MyListener {
@Autowired
JMSSupport jmsSupport;
@JmsListener(containerFactory = "myContainerInstance", subscription =
"mySubscriptionName", destination = "myTopicName")
public void onMessage(Message myMessage) throws JMSException {
jmsSupport.consumeMessage(message -> {
// business logic consuming the message goes here.
}, myMessage, TextMessage.class);
}
}
@Component
public class JMSSupport {
public <T extends Message> void consumeMessage(Consumer<T> messageConsumer,
Message message, Class<T> acceptedMessageType) throws JMSException {
try {
// message validation goes here
messageConsumer.accept((T) message);
} catch (Exception ex) {
message.setIntProperty(JmsMessageSupport.JMS_AMQP_ACK_TYPE,
JmsMessageSupport.MODIFIED_FAILED);
throw ex;
} finally {
message.acknowledge();
}
}
}
{code}
In general, Qpid seems to be able to handle refreshing the IdleTimeout using
half of the time configured which is configured in the URL.
We could not yet find any explanation why after an random amount of time once
of three subscriptions gets into the state illustrated above while other two
subscriptions continue working. Please note that in between the reconnect
loops, the re-established connection / subscription will not receive any
pending messages.
What is implemented differently from the most common scenarios?
* use CLIENT_ACKNOWLEDGE mode
What attempts have been made to work around or reproduce the issue (none
worked)?
* We have tried run the client in a docker container and introduce network
packet loss (tc) - could not provoke the issue yet.
* switch from DefaultJmsListenerContainerFactory to SimpleJmsContainerFactory
* use a dedicated JmsListenerContainerFactory per subscription, even on the
same server
* use failover:(amqps://) instead of amqps://
* recently we did learn, that it makes no difference if the client connects to
one or two Service Bus instances, the problem does also show up in 1 JMS Client
-> 1 Service Bus Server -> 3 Subscription scenarios. Following that path might
be a red herring.
* The amount of time a subscription is in an idle state does't seem to have
any effect on the issue. Or at least there seems not to be any difference
between idle time of few minutes vs. idle time of days. We will be monitoring
that more closely in the future.
was:
We are experiencing connection recovery loops after an random (hours to days)
amount of time on JMS connection subscribed to a Service Bus Topic via
(durable) Subscription. The application does connect to two different service
bus instances at the same time.
We had to use the failover:(amqp:...) URL pattern in order to actually see
something in the log files as the FailoverProvider is a bit more verbose (TRACE
level). When using plain ampqs:// connection URL pattern the only way to tell a
hanging/stalled subscription was to monitor the number of active/unconsumed
Messages in a Service Bus' Subscription using a different client. Regardless of
the Provider implementation, the problem of stalled subscriptions persisted.
Log observations so far:
After an initial disconnect:
{code:java}
o.a.qpid.jms.provider.amqp.AmqpProvider : New Proton Event:
CONNECTION_REMOTE_CLOSE
(...)
Failover: the provider reports failure: The connection was inactive for more
than the allowed 300000 milliseconds and is closed by container 'LinkTracker'.
TrackingId:c2925ed17c9549b29a825cc581852d1a_G6, SystemTracker:gateway9,
Timestamp:5/1/2018 6:11:44 PM [condition = amqp:connection:forced]
{code}
the application will enter into a reconnect/connection recovery loop with an
interval of one minute.
{code:java}
18:11:44.670Z Failover: the provider reports failure: The connection was
inactive for more than the allowed 300000 milliseconds and is closed by
container 'LinkTracker'. TrackingId:c2925ed17c9549b29a825cc581852d1a_G6,
SystemTracker:gateway9, Timestamp:5/1/2018 6:11:44 PM [condition =
amqp:connection:forced]
18:12:44.789Z Failover: the provider reports failure: The connection was
inactive for more than the allowed 60000 milliseconds and is closed by
container 'LinkTracker'. TrackingId:50b008fdb56f443b947d76ded87691a9_G5,
SystemTracker:gateway9, Timestamp:5/1/2018 6:12:44 PM [condition =
amqp:connection:forced]
18:13:44.918Z Failover: the provider reports failure: The connection was
inactive for more than the allowed 60000 milliseconds and is closed by
container 'LinkTracker'. TrackingId:a2832475e8ab4702b6189d9218588132_G15,
SystemTracker:gateway9, Timestamp:5/1/2018 6:13:44 PM [condition =
amqp:connection:forced]
{code}
In between we see the connection being recovered along with log messages as:
{code:java}
org.apache.qpid.jms.JmsConnection : Connection
ID:8061906b-13e9-46a3-b6d0-fd227b339f67:1 is finalizing recovery
{code}
followed by
{code:java}
o.a.qpid.jms.provider.amqp.AmqpProvider : New Proton Event:
SESSION_REMOTE_CLOSE
{code}
after a minute.
A full cycle is attached in the log file, URLs are anonymized.
In general we can see the following sequence of Proton-Events repeating:
{code:java}
23:58:53.676Z CONNECTION_INIT
23:58:53.722Z CONNECTION_LOCAL_OPEN
23:58:53.722Z CONNECTION_REMOTE_OPEN
23:58:53.722Z SESSION_INIT
23:58:53.722Z SESSION_LOCAL_OPEN
23:58:53.738Z SESSION_REMOTE_OPEN
23:59:53.690Z SESSION_REMOTE_CLOSE
23:59:53.690Z SESSION_LOCAL_CLOSE
23:59:53.690Z SESSION_FINAL
23:59:53.706Z CONNECTION_REMOTE_CLOSE
23:59:53.706Z TRANSPORT_TAIL_CLOSED
23:59:53.706Z CONNECTION_LOCAL_CLOSE
23:59:53.706Z TRANSPORT_HEAD_CLOSED
23:59:53.706Z TRANSPORT_CLOSED
{code}
We did also open a ticket with Microsoft / Azure as the disconnects seemed to
originate from the Service Bus instance (SESSION_REMOTE_CLOSE) where we did
learn, that the Service Bus has a hardwired TTL of 5 minutes before inactive
Subscriptions will be terminated. According to Azure-Support the issue is most
likely related to Qpid connection handling. This does at least explain the
first / initial timeout of 300000 seen in the logs. The connection / client is
set up as follows
URLs:
{code:java}
failover:(amqps://instance1.servicebus.windows.net)?amqp.idleTimeout=12000
failover:(amqps://instance2.servicebus.windows.net)?amqp.idleTimeout=12000
{code}
Connection Factories:
{code:java}
JmsConnectionFactory qpidConnectionFactory = new
JmsConnectionFactory(sbUsername, sbPassword, sbAmqpRemoteUri);
qpidConnectionFactory.setReceiveLocalOnly(true);
qpidConnectionFactory.setClientID(UUID.randomUUID().toString()); {code}
JmsListenerContainerFactories:
{code:java}
DefaultJmsListenerContainerFactory jmsListenerContainerFactory = new
DefaultJmsListenerContainerFactory();
jmsListenerContainerFactory.setConnectionFactory(jmsConnectionFactory);
jmsListenerContainerFactory.setPubSubDomain(true);
jmsListenerContainerFactory.setSubscriptionDurable(true);
jmsListenerContainerFactory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
return jmsListenerContainerFactory;{code}
And Spring-JMS Listeners:
{code:java}
@Component
public class MyListener {
@Autowired
JMSSupport jmsSupport;
@JmsListener(containerFactory = "myContainerInstance", subscription =
"mySubscriptionName", destination = "myTopicName")
public void onMessage(Message myMessage) throws JMSException {
jmsSupport.consumeMessage(message -> {
// business logic consuming the message goes here.
}, myMessage, TextMessage.class);
}
}
@Component
public class JMSSupport {
public <T extends Message> void consumeMessage(Consumer<T> messageConsumer,
Message message, Class<T> acceptedMessageType) throws JMSException {
try {
// message validation goes here
messageConsumer.accept((T) message);
} catch (Exception ex) {
message.setIntProperty(JmsMessageSupport.JMS_AMQP_ACK_TYPE,
JmsMessageSupport.MODIFIED_FAILED);
throw ex;
} finally {
message.acknowledge();
}
}
}
{code}
In general, Qpid seems to be able to handle refreshing the IdleTimeout using
half of the time configured which is configured in the URL.
We could not yet find any explanation why after an random amount of time once
of three subscriptions gets into the state illustrated above while other two
subscriptions continue working. Please note that in between the reconnect
loops, the re-established connection / subscription will not receive any
pending messages.
What is implemented differently from the most common scenarios?
* use CLIENT_ACKNOWLEDGE mode
What attempts have been made to work around or reproduce the issue (none
worked)?
* We have tried run the client in a docker container and introduce network
packet loss (tc) - could not provoke the issue yet.
* switch from DefaultJmsListenerContainerFactory to SimpleJmsContainerFactory
* use a dedicated JmsListenerContainerFactory per subscription, even on the
same server
* use failover:(amqps://) instead of amqps://
* recently we did learn, that it makes no difference if the client connects to
one or two Service Bus instances, the problem does also show up in 1 JMS Client
-> 1 Service Bus Server -> 3 Subscription scenarios. Following that path might
be a red herring.
* The amount of time a subscription is in an idle state does't seem to have
any effect on the issue. Or at least there seems not to be any difference
between idle time of few minutes vs. idle time of days. We will be monitoring
that more closely in the future.
> Stalled Service Bus Subscriptions, Reconnect Loops
> --------------------------------------------------
>
> Key: QPIDJMS-392
> URL: https://issues.apache.org/jira/browse/QPIDJMS-392
> Project: Qpid JMS
> Issue Type: Bug
> Components: qpid-jms-client
> Affects Versions: 0.32.0
> Environment: * Azure App Service
> * Spring 1.5.12 via embedded web server / standalone
> * java.runtime.version "1.8.0_144-b01"
> * java.vm.version "25.144-b01"
> * java.vm.vendor "Azul Systems, Inc."
> * os.arch "amd64"
> * os.name "Windows Server 2016"
> Reporter: Paul Rogalinski
> Priority: Major
> Attachments: qpid-log.txt
>
>
> We are experiencing connection recovery loops after an random (hours to days)
> amount of time on JMS connection subscribed to a Service Bus Topic via
> (durable) Subscription. The application does connect to two different service
> bus instances at the same time.
> We had to use the failover:(amqp:...) URL pattern in order to actually see
> something in the log files as the FailoverProvider is a bit more verbose
> (TRACE level). When using plain ampqs:// connection URL pattern the only way
> to tell a hanging/stalled subscription was to monitor the number of
> active/unconsumed Messages in a Service Bus' Subscription using a different
> client. Regardless of the Provider implementation, the problem of stalled
> subscriptions persisted.
> Log observations so far:
> After an initial disconnect:
> {code:java}
> o.a.qpid.jms.provider.amqp.AmqpProvider : New Proton Event:
> CONNECTION_REMOTE_CLOSE
> (...)
> Failover: the provider reports failure: The connection was inactive for more
> than the allowed 300000 milliseconds and is closed by container
> 'LinkTracker'. TrackingId:c2925ed17c9549b29a825cc581852d1a_G6,
> SystemTracker:gateway9, Timestamp:5/1/2018 6:11:44 PM [condition =
> amqp:connection:forced]
> {code}
> the application will enter into a reconnect/connection recovery loop with an
> interval of one minute.
> {code:java}
> 18:11:44.670Z Failover: the provider reports failure: The connection was
> inactive for more than the allowed 300000 milliseconds and is closed by
> container 'LinkTracker'. TrackingId:c2925ed17c9549b29a825cc581852d1a_G6,
> SystemTracker:gateway9, Timestamp:5/1/2018 6:11:44 PM [condition =
> amqp:connection:forced]
> 18:12:44.789Z Failover: the provider reports failure: The connection was
> inactive for more than the allowed 60000 milliseconds and is closed by
> container 'LinkTracker'. TrackingId:50b008fdb56f443b947d76ded87691a9_G5,
> SystemTracker:gateway9, Timestamp:5/1/2018 6:12:44 PM [condition =
> amqp:connection:forced]
> 18:13:44.918Z Failover: the provider reports failure: The connection was
> inactive for more than the allowed 60000 milliseconds and is closed by
> container 'LinkTracker'. TrackingId:a2832475e8ab4702b6189d9218588132_G15,
> SystemTracker:gateway9, Timestamp:5/1/2018 6:13:44 PM [condition =
> amqp:connection:forced]
> {code}
> In between we see the connection being recovered along with log messages as:
> {code:java}
> org.apache.qpid.jms.JmsConnection : Connection
> ID:8061906b-13e9-46a3-b6d0-fd227b339f67:1 is finalizing recovery
> {code}
> followed by
> {code:java}
> o.a.qpid.jms.provider.amqp.AmqpProvider : New Proton Event:
> SESSION_REMOTE_CLOSE
> {code}
> after a minute.
> A full cycle is attached in the log file, URLs are anonymized.
> In general we can see the following sequence of Proton-Events repeating:
> {code:java}
> 23:58:53.676Z CONNECTION_INIT
> 23:58:53.722Z CONNECTION_LOCAL_OPEN
> 23:58:53.722Z CONNECTION_REMOTE_OPEN
> 23:58:53.722Z SESSION_INIT
> 23:58:53.722Z SESSION_LOCAL_OPEN
> 23:58:53.738Z SESSION_REMOTE_OPEN
> 23:59:53.690Z SESSION_REMOTE_CLOSE
> 23:59:53.690Z SESSION_LOCAL_CLOSE
> 23:59:53.690Z SESSION_FINAL
> 23:59:53.706Z CONNECTION_REMOTE_CLOSE
> 23:59:53.706Z TRANSPORT_TAIL_CLOSED
> 23:59:53.706Z CONNECTION_LOCAL_CLOSE
> 23:59:53.706Z TRANSPORT_HEAD_CLOSED
> 23:59:53.706Z TRANSPORT_CLOSED
> {code}
> We did also open a ticket with Microsoft / Azure as the disconnects seemed to
> originate from the Service Bus instance (SESSION_REMOTE_CLOSE) where we did
> learn, that the Service Bus has a hardwired TTL of 5 minutes before inactive
> Subscriptions will be terminated. According to Azure-Support the issue is
> most likely related to Qpid connection handling. This does at least explain
> the first / initial timeout of 300000 seen in the logs. The connection /
> client is set up as follows
> URLs:
> {code:java}
> failover:(amqps://instance1.servicebus.windows.net)?amqp.idleTimeout=12000
> failover:(amqps://instance2.servicebus.windows.net)?amqp.idleTimeout=12000
> {code}
> Connection Factories:
> {code:java}
> JmsConnectionFactory qpidConnectionFactory = new
> JmsConnectionFactory(sbUsername, sbPassword, sbAmqpRemoteUri);
>
> qpidConnectionFactory.setReceiveLocalOnly(true);
> qpidConnectionFactory.setClientID(UUID.randomUUID().toString()); {code}
> JmsListenerContainerFactories:
> {code:java}
> DefaultJmsListenerContainerFactory jmsListenerContainerFactory = new
> DefaultJmsListenerContainerFactory();
> jmsListenerContainerFactory.setConnectionFactory(jmsConnectionFactory);
> jmsListenerContainerFactory.setPubSubDomain(true);
> jmsListenerContainerFactory.setSubscriptionDurable(true);
> jmsListenerContainerFactory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
> return jmsListenerContainerFactory;{code}
> And Spring-JMS Listeners:
> {code:java}
> @Component
> public class MyListener {
> @Autowired
> JMSSupport jmsSupport;
>
> @JmsListener(containerFactory = "myContainerInstance", subscription =
> "mySubscriptionName", destination = "myTopicName")
> public void onMessage(Message myMessage) throws JMSException {
> jmsSupport.consumeMessage(message -> {
> // business logic consuming the message goes here.
> }, myMessage, TextMessage.class);
> }
> }
>
> @Component
> public class JMSSupport {
> public <T extends Message> void consumeMessage(Consumer<T>
> messageConsumer, Message message, Class<T> acceptedMessageType) throws
> JMSException {
> try {
> // message validation goes here
> messageConsumer.accept((T) message);
> } catch (Exception ex) {
> message.setIntProperty(JmsMessageSupport.JMS_AMQP_ACK_TYPE,
> JmsMessageSupport.MODIFIED_FAILED);
> throw ex;
> } finally {
> message.acknowledge();
> }
> }
> }
> {code}
> In general, Qpid seems to be able to handle refreshing the IdleTimeout using
> half of the time configured which is configured in the URL.
> We could not yet find any explanation why after an random amount of time once
> of three subscriptions gets into the state illustrated above while other two
> subscriptions continue working. Please note that in between the reconnect
> loops, the re-established connection / subscription will not receive any
> pending messages.
> What is implemented differently from the most common scenarios?
> * use CLIENT_ACKNOWLEDGE mode
> What attempts have been made to work around or reproduce the issue (none
> worked)?
> * We have tried run the client in a docker container and introduce network
> packet loss (tc) - could not provoke the issue yet.
> * switch from DefaultJmsListenerContainerFactory to SimpleJmsContainerFactory
> * use a dedicated JmsListenerContainerFactory per subscription, even on the
> same server
> * use failover:(amqps://) instead of amqps://
> * recently we did learn, that it makes no difference if the client connects
> to one or two Service Bus instances, the problem does also show up in 1 JMS
> Client -> 1 Service Bus Server -> 3 Subscription scenarios. Following that
> path might be a red herring.
> * The amount of time a subscription is in an idle state does't seem to have
> any effect on the issue. Or at least there seems not to be any difference
> between idle time of few minutes vs. idle time of days. We will be monitoring
> that more closely in the future.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]