[jira] [Commented] (PROTON-2411) Simultaneous idle timeout sequencing errors

Clifford Jansen (Jira) Tue, 03 Aug 2021 15:38:06 -0700


    [ 
https://issues.apache.org/jira/browse/PROTON-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392573#comment-17392573
 ]


Clifford Jansen commented on PROTON-2411:
-----------------------------------------

p2411_0.diff is a patch that can be applied to Proton 0.34 to help debug this 
issue.

Instead of aborting if an AMQP connection is seen to set an earlier heartbeat 
timeout more than once, it prints a detailed diagnostic and continues to run.

The problem is supposed to be very rare and this change could introduce some 
new runaway problem so if there are more than a handful of such sequencing 
errors on a single connection, the connection is terminated, and the process 
can continue to run, perhaps to reconnect as for any other temporary network 
failure (or to continue listening in the case of the router).

To collect the error information Proton clients should be started with

    PN_LOG=ERROR+

in their process environment, or any other setting that includes ERROR level 
logging.

Similarly, the router configuration should allow "error+" logging levels.

The log messages will contain either

    "timer sequence error" or "timer multi sequence errors"

If you use the patch and find examples of these errors in the logs, please add 
a representative sample to the JIRA.

> Simultaneous idle timeout sequencing errors
> -------------------------------------------
>
>                 Key: PROTON-2411
>                 URL: https://issues.apache.org/jira/browse/PROTON-2411
>             Project: Qpid Proton
>          Issue Type: Bug
>          Components: proton-c
>    Affects Versions: proton-c-0.34.0
>            Reporter: Jaap Wiggelinkhuizen
>            Priority: Critical
>         Attachments: p2411_0.diff
>
>
> In our mission critical software we use Qpid proton 0.34.0 in our C++-client 
> software together with the Qpid dispatch router 1.16.0. We updated to these 
> versions not so long ago, before we used proton 0.25.0 and dispatch 1.3.0. 
> Our application runs on several VM’s with a router on each VM. All clients 
> connect to the local router only and the routers connect to eachother in a 
> hub spoke pattern. In both the client configuration as the router 
> configuration we have configured an idle timeout of 30 seconds.
> On July 4th we were confronted with an incident in production where a lot of 
> our client processes reported problems regarding the idle timeouts. These 
> client processes were already running stable for more than 3 weeks. The 
> problem appeared in two flavors:
>  # Transport error “error: amqp:resource-limit-exceeded: local-idle-timeout 
> expired”
>  # epoll proactor failure in epoll_timer.c:263: “idle timeout sequencing 
> error”
> On each VM at least 3 processes showed one of these problems in a total time 
> window of less than a minute. We haven’t found any cause in the underlying 
> hardware, hypervisor, network or operating system until now.
> Although we don’t know the root cause of the problems, we can solve the first 
> situation by using the proper reconnect settings (by mistake we handled 
> on_transport_error() as a fatal situation and will correct that so that only 
> on_transport_close() will be handled as fatal). However the second situation 
> is more odd because it results in an abort within proton itself. The comments 
> in epoll_timer.c explain that this error occurs when a connection timer is 
> moved backwards a second time. We don’t understand how this can happen 
> suddenly.
>  
> Last sunday the problem occurred again on two more production sites where our 
> software was operational just over 3 weeks now. And again it has happened on 
> all VM's within a short timeframe. It's interesting that it only occurs on 
> sunday mornings until now. Maybe it has something to do with how long the 
> software is running and the fact that on sunday mornings there is less 
> messaging traffic, i.e. more heartbeats?...
>  
> Unfortunately we haven't been able to reproduce the issue at our test 
> facilities and hence can not provide a reproducer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PROTON-2411) Simultaneous idle timeout sequencing errors

Reply via email to