[ 
https://issues.apache.org/jira/browse/PROTON-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaap Wiggelinkhuizen updated PROTON-2411:
-----------------------------------------
    Affects Version/s: proton-c-0.34.0
             Priority: Critical  (was: Major)

> Simultaneous idle timeout sequencing errors
> -------------------------------------------
>
>                 Key: PROTON-2411
>                 URL: https://issues.apache.org/jira/browse/PROTON-2411
>             Project: Qpid Proton
>          Issue Type: Bug
>    Affects Versions: proton-c-0.34.0
>            Reporter: Jaap Wiggelinkhuizen
>            Priority: Critical
>
> In our mission critical software we use Qpid proton 0.34.0 in our C++-client 
> software together with the Qpid dispatch router 1.16.0. We updated to these 
> versions not so long ago, before we used proton 0.25.0 and dispatch 1.3.0. 
> Our application runs on several VM’s with a router on each VM. All clients 
> connect to the local router only and the routers connect to eachother in a 
> hub spoke pattern. In both the client configuration as the router 
> configuration we have configured an idle timeout of 30 seconds.
> On July 4th we were confronted with an incident in production where a lot of 
> our client processes reported problems regarding the idle timeouts. These 
> client processes were already running stable for more than 3 weeks. The 
> problem appeared in two flavors:
>  # Transport error “error: amqp:resource-limit-exceeded: local-idle-timeout 
> expired”
>  # epoll proactor failure in epoll_timer.c:263: “idle timeout sequencing 
> error”
> On each VM at least 3 processes showed one of these problems in a total time 
> window of less than a minute. We haven’t found any cause in the underlying 
> hardware, hypervisor, network or operating system until now.
> Although we don’t know the root cause of the problems, we can solve the first 
> situation by using the proper reconnect settings (by mistake we handled 
> on_transport_error() as a fatal situation and will correct that so that only 
> on_transport_close() will be handled as fatal). However the second situation 
> is more odd because it results in an abort within proton itself. The comments 
> in epoll_timer.c explain that this error occurs when a connection timer is 
> moved backwards a second time. We don’t understand how this can happen 
> suddenly.
>  
> Last sunday the problem occurred again on two more production sites where our 
> software was operational just over 3 weeks now. And again it has happened on 
> all VM's within a short timeframe. It's interesting that it only occurs on 
> sunday mornings until now. Maybe it has something to do with how long the 
> software is running and the fact that on sunday mornings there is less 
> messaging traffic, i.e. more heartbeats?...
>  
> Unfortunately we haven't been able to reproduce the issue at our test 
> facilities and hence can not provide a reproducer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to