[
https://issues.apache.org/jira/browse/PROTON-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383088#comment-17383088
]
Jaap Wiggelinkhuizen commented on PROTON-2411:
----------------------------------------------
Interestingly the coredumps of the incidents show that the abort occurred at
the EPOLL_FATAL in replace_timer_deadline() although stderr shows the
error-message of the EPOLL_FATAL in pni_timer_set() ???
{code:java}
Program terminated with signal SIGABRT, Aborted.
#0 0xf77d0430 in __kernel_vsyscall ()
[Current thread is 1 (Thread 0xf17ffb40 (LWP 57506))]
Missing separate debuginfos, use: debuginfo-install expat-2.1.0-10.el7_3.i686
fontconfig-2.10.95-11.el7.i686 freetype-2.4.11-15.el7.i686
glibc-2.17-222.el7.i686 keyutils-libs-1.5.8-3.el7.i686
krb5-libs-1.15.1-18.el7.i686 libICE-1.0.9-9.el7.i686 libSM-1.2.2-2.el7.i686
libX11-1.6.5-1.el7.i686 libXau-1.0.8-2.1.el7.i686 libXext-1.3.3-3.el7.i686
libXft-2.3.2-2.el7.i686 libXmu-1.1.2-2.el7.i686 libXp-1.0.2-2.1.el7.i686
libXrender-0.9.10-1.el7.i686 libXt-1.1.5-3.el7.i686
libcom_err-1.42.9-11.el7.i686 libgcc-4.8.5-28.el7.i686
libjpeg-turbo-1.2.90-5.el7.i686 libpng-1.5.13-7.el7_2.i686
libselinux-2.5-12.el7.i686 libstdc++-4.8.5-28.el7.i686 libunwind-1.2-2.el7.i686
libuuid-2.23.2-52.el7.i686 libxcb-1.12-1.el7.i686 motif-2.3.4-12.el7_4.i686
ncurses-libs-5.9-14.20130511.el7_4.i686 openssl-libs-1.0.2k-12.el7.i686
pcre-8.32-17.el7.i686 protobuf-2.5.0-8.el7.i686 xerces-c-3.1.1-8.el7_2.i686
(gdb) where
#0 0xf77d0430 in __kernel_vsyscall ()
#1 0xf6af8147 in raise () from /lib/libc.so.6
#2 0xf6af9a52 in abort () from /lib/libc.so.6
#3 0xf6964bac in replace_timer_deadline (timer=<optimized out>, tm=<optimized
out>)
at
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll_timer.c:375
#4 pni_timer_set (timer=<optimized out>, deadline=<optimized out>) at
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll_timer.c:267
#5 0xf6964f7d in pconnection_tick (pc=pc@entry=0xf3a06a00) at
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll.c:1433
#6 0xf696692e in pconnection_process (pc=pc@entry=0xf3a06a00,
events=<optimized out>, events@entry=1, sched_ready=sched_ready@entry=false,
topup=false)
at
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll.c:1211
#7 0xf6968237 in process (tsk=0xf3a06a00) at
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll.c:2213
#8 next_event_batch (p=0xa4ec218, can_block=true) at
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll.c:2423
#9 0xf6e14b9d in proton::container::impl::thread() (this=0xa4ec168)
at
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/cpp/src/proactor_container_impl.cpp:760
#10 0xf6e1536d in proton::container::impl::run(int) (this=0xa4ec168, threads=1)
at
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/cpp/src/proactor_container_impl.cpp:812
#11 0xf6e020be in proton::container::run() (this=0xa4ec034) at
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/cpp/src/container.cpp:92
#12 0x0868eeb2 in pbamq::Communicator::ProtonContainerThread (this=0xa4ec008)
at /usr/src/prl/src_0/vptlib/postbus/amq/communicator.cxx:657
#13 0xf6e28f5d in execute_native_thread_routine () from
/opt/qpid-itr/lib/libqpid-proton-cpp.so.12
#14 0xf77aeb1c in start_thread () from /lib/libpthread.so.0
#15 0xf6bc745e in clone () from /lib/libc.so.6
{code}
> Simultaneous idle timeout sequencing errors
> -------------------------------------------
>
> Key: PROTON-2411
> URL: https://issues.apache.org/jira/browse/PROTON-2411
> Project: Qpid Proton
> Issue Type: Bug
> Reporter: Jaap Wiggelinkhuizen
> Priority: Major
>
> In our mission critical software we use Qpid proton 0.34.0 in our C++-client
> software together with the Qpid dispatch router 1.16.0. We updated to these
> versions not so long ago, before we used proton 0.25.0 and dispatch 1.3.0.
> Our application runs on several VM’s with a router on each VM. All clients
> connect to the local router only and the routers connect to eachother in a
> hub spoke pattern. In both the client configuration as the router
> configuration we have configured an idle timeout of 30 seconds.
> On July 4th we were confronted with an incident in production where a lot of
> our client processes reported problems regarding the idle timeouts. These
> client processes were already running stable for more than 3 weeks. The
> problem appeared in two flavors:
> # Transport error “error: amqp:resource-limit-exceeded: local-idle-timeout
> expired”
> # epoll proactor failure in epoll_timer.c:263: “idle timeout sequencing
> error”
> On each VM at least 3 processes showed one of these problems in a total time
> window of less than a minute. We haven’t found any cause in the underlying
> hardware, hypervisor, network or operating system until now.
> Although we don’t know the root cause of the problems, we can solve the first
> situation by using the proper reconnect settings (by mistake we handled
> on_transport_error() as a fatal situation and will correct that so that only
> on_transport_close() will be handled as fatal). However the second situation
> is more odd because it results in an abort within proton itself. The comments
> in epoll_timer.c explain that this error occurs when a connection timer is
> moved backwards a second time. We don’t understand how this can happen
> suddenly.
>
> Last sunday the problem occurred again on two more production sites where our
> software was operational just over 3 weeks now. And again it has happened on
> all VM's within a short timeframe. It's interesting that it only occurs on
> sunday mornings until now. Maybe it has something to do with how long the
> software is running and the fact that on sunday mornings there is less
> messaging traffic, i.e. more heartbeats?...
>
> Unfortunately we haven't been able to reproduce the issue at our test
> facilities and hence can not provide a reproducer.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]