[
https://issues.apache.org/jira/browse/PROTON-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jaap Wiggelinkhuizen updated PROTON-2411:
-----------------------------------------
Affects Version/s: proton-c-0.34.0
Priority: Critical (was: Major)
> Simultaneous idle timeout sequencing errors
> -------------------------------------------
>
> Key: PROTON-2411
> URL: https://issues.apache.org/jira/browse/PROTON-2411
> Project: Qpid Proton
> Issue Type: Bug
> Affects Versions: proton-c-0.34.0
> Reporter: Jaap Wiggelinkhuizen
> Priority: Critical
>
> In our mission critical software we use Qpid proton 0.34.0 in our C++-client
> software together with the Qpid dispatch router 1.16.0. We updated to these
> versions not so long ago, before we used proton 0.25.0 and dispatch 1.3.0.
> Our application runs on several VM’s with a router on each VM. All clients
> connect to the local router only and the routers connect to eachother in a
> hub spoke pattern. In both the client configuration as the router
> configuration we have configured an idle timeout of 30 seconds.
> On July 4th we were confronted with an incident in production where a lot of
> our client processes reported problems regarding the idle timeouts. These
> client processes were already running stable for more than 3 weeks. The
> problem appeared in two flavors:
> # Transport error “error: amqp:resource-limit-exceeded: local-idle-timeout
> expired”
> # epoll proactor failure in epoll_timer.c:263: “idle timeout sequencing
> error”
> On each VM at least 3 processes showed one of these problems in a total time
> window of less than a minute. We haven’t found any cause in the underlying
> hardware, hypervisor, network or operating system until now.
> Although we don’t know the root cause of the problems, we can solve the first
> situation by using the proper reconnect settings (by mistake we handled
> on_transport_error() as a fatal situation and will correct that so that only
> on_transport_close() will be handled as fatal). However the second situation
> is more odd because it results in an abort within proton itself. The comments
> in epoll_timer.c explain that this error occurs when a connection timer is
> moved backwards a second time. We don’t understand how this can happen
> suddenly.
>
> Last sunday the problem occurred again on two more production sites where our
> software was operational just over 3 weeks now. And again it has happened on
> all VM's within a short timeframe. It's interesting that it only occurs on
> sunday mornings until now. Maybe it has something to do with how long the
> software is running and the fact that on sunday mornings there is less
> messaging traffic, i.e. more heartbeats?...
>
> Unfortunately we haven't been able to reproduce the issue at our test
> facilities and hence can not provide a reproducer.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]