Re: local-idle-timeout and idle timeout sequencing errors on several instances

Cliff Jansen Fri, 16 Jul 2021 09:23:32 -0700

This is not a known bug. Despite your providing a helpful detailed account,
I am unable to see the possibility of a second “earlier” deadline in the
life of an AMQP connection.  Even being off by one.


Please raise a JIRA including any additional information you can think of.

Obviously a reproducer would be ideal, but may be hard to provide.

Are you building your own Proton libraries from source? If so I could try
to put together a patch that would be more resilient in the abort case and
gather some additional bread crumbs to help analyze the circumstances of
the failure.

Cliff



On Thu, Jul 15, 2021 at 3:31 AM Wiggelinkhuizen J (Jaap) <
jaap.wiggelinkhui...@intraffic.nl> wrote:

> Dear Qpid users,
>
>
>
> In our mission critical software for the Dutch government we use Qpid
> proton 0.34.0 in our C++-client software together with the Qpid dispatch
> router 1.16.0. We updated to these versions not so long ago, before we used
> proton 0.25.0 and dispatch 1.3.0. Our application runs on several VM’s with
> a router on each VM. All clients connect to the local router only and the
> routers connect to eachother in a hub spoke pattern. In both the client
> configuration as the router configuration we have configured an idle
> timeout of 30 seconds.
>
>
>
> Two weeks ago we were confronted with an incident in production where a
> lot of our client processes reported problems regarding the idle timeouts.
> These client processes were already running stable for more than 3 weeks.
> The problem appeared in two flavors:
>
>    1. Transport error “error: amqp:resource-limit-exceeded:
>    local-idle-timeout expired”
>    2. epoll proactor failure in epoll_timer.c:263: “idle timeout
>    sequencing error”
>
> On each VM at least 3 processes showed one of these problems in a time
> window of less than a minute. We haven’t found any cause in the underlying
> hardware, hypervisor, network or operating system until now.
>
>
>
> Although we don’t know the root cause of the problems, we can solve the
> first situation by using the proper reconnect settings. However the second
> situation is more odd because it results in an abort within proton itself.
> The comments in epoll_timer.c explain that this error occurs when a
> connection timer is moved backwards a second time. We don’t understand how
> this can happen suddenly.
>
>
>
> Does anyone have experienced similar problems using recent proton versions
> (the epoll_timer.c module is introduced in version 0.33.0). And even more
> important is there a solution or workaround?
>
>
>
> Looking forward to any reaction. Thanks in advance!
>
>
>
> With kind regards,
>
>
>
> *Jaap Wiggelinkhuizen*
>
> Software architect & Systeem integrator
>
>
>
>
>
>
>
> *E*    *jaap.wiggelinkhui...@intraffic.nl
> <jaap.wiggelinkhui...@intraffic.nl>*
>
> *W*   intraffic.nl <https://www.intraffic.nl/>
>
>
>
> [image: Afbeelding met tekening, bord Automatisch gegenereerde
> beschrijving] <https://www.linkedin.com/company/intraffic>
> <https://twitter.com/InTrafficNL>
> <https://www.youtube.com/channel/UCPQeh0v2U2v2hBRlNFXNY9A>
> <https://www.facebook.com/InTrafficNL/>
>
>
>
> *Visiting address: Iepenhoeve 11, 3438 MR Nieuwegein
> <https://www.google.com/maps/search/Iepenhoeve+11,+3438+MR+Nieuwegein?entry=gmail&source=g>*
>
>
>
> <https://ictgroup.eu/>
>
>
>

Re: local-idle-timeout and idle timeout sequencing errors on several instances

Reply via email to