Jaap Wiggelinkhuizen created PROTON-2411:
--------------------------------------------

             Summary: Simultaneous idle timeout sequencing errors
                 Key: PROTON-2411
                 URL: https://issues.apache.org/jira/browse/PROTON-2411
             Project: Qpid Proton
          Issue Type: Bug
            Reporter: Jaap Wiggelinkhuizen


In our mission critical software we use Qpid proton 0.34.0 in our C++-client 
software together with the Qpid dispatch router 1.16.0. We updated to these 
versions not so long ago, before we used proton 0.25.0 and dispatch 1.3.0. Our 
application runs on several VM’s with a router on each VM. All clients connect 
to the local router only and the routers connect to eachother in a hub spoke 
pattern. In both the client configuration as the router configuration we have 
configured an idle timeout of 30 seconds.

On July 4th we were confronted with an incident in production where a lot of 
our client processes reported problems regarding the idle timeouts. These 
client processes were already running stable for more than 3 weeks. The problem 
appeared in two flavors:
 # Transport error “error: amqp:resource-limit-exceeded: local-idle-timeout 
expired”
 # epoll proactor failure in epoll_timer.c:263: “idle timeout sequencing error”

On each VM at least 3 processes showed one of these problems in a total time 
window of less than a minute. We haven’t found any cause in the underlying 
hardware, hypervisor, network or operating system until now.

Although we don’t know the root cause of the problems, we can solve the first 
situation by using the proper reconnect settings (by mistake we handled 
on_transport_error() as a fatal situation and will correct that so that only 
on_transport_close() will be handled as fatal). However the second situation is 
more odd because it results in an abort within proton itself. The comments in 
epoll_timer.c explain that this error occurs when a connection timer is moved 
backwards a second time. We don’t understand how this can happen suddenly.

 

Last sunday the problem occurred again on two more production sites where our 
software was operational just over 3 weeks now. And again it has happened on 
all VM's within a short timeframe. It's interesting that it only occurs on 
sunday mornings until now. Maybe it has something to do with how long the 
software is running and the fact that on sunday mornings there is less 
messaging traffic, i.e. more heartbeats?...

 

Unfortunately we haven't been able to reproduce the issue at our test 
facilities and hence can not provide a reproducer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to