[ 
https://issues.apache.org/jira/browse/PROTON-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383088#comment-17383088
 ] 

Jaap Wiggelinkhuizen commented on PROTON-2411:
----------------------------------------------

Interestingly the coredumps of the incidents show that the abort occurred at 
the EPOLL_FATAL in replace_timer_deadline() although stderr shows the 
error-message of the EPOLL_FATAL in pni_timer_set() ???
{code:java}
Program terminated with signal SIGABRT, Aborted.
#0  0xf77d0430 in __kernel_vsyscall ()
[Current thread is 1 (Thread 0xf17ffb40 (LWP 57506))]
Missing separate debuginfos, use: debuginfo-install expat-2.1.0-10.el7_3.i686 
fontconfig-2.10.95-11.el7.i686 freetype-2.4.11-15.el7.i686 
glibc-2.17-222.el7.i686 keyutils-libs-1.5.8-3.el7.i686 
krb5-libs-1.15.1-18.el7.i686 libICE-1.0.9-9.el7.i686 libSM-1.2.2-2.el7.i686 
libX11-1.6.5-1.el7.i686 libXau-1.0.8-2.1.el7.i686 libXext-1.3.3-3.el7.i686 
libXft-2.3.2-2.el7.i686 libXmu-1.1.2-2.el7.i686 libXp-1.0.2-2.1.el7.i686 
libXrender-0.9.10-1.el7.i686 libXt-1.1.5-3.el7.i686 
libcom_err-1.42.9-11.el7.i686 libgcc-4.8.5-28.el7.i686 
libjpeg-turbo-1.2.90-5.el7.i686 libpng-1.5.13-7.el7_2.i686 
libselinux-2.5-12.el7.i686 libstdc++-4.8.5-28.el7.i686 libunwind-1.2-2.el7.i686 
libuuid-2.23.2-52.el7.i686 libxcb-1.12-1.el7.i686 motif-2.3.4-12.el7_4.i686 
ncurses-libs-5.9-14.20130511.el7_4.i686 openssl-libs-1.0.2k-12.el7.i686 
pcre-8.32-17.el7.i686 protobuf-2.5.0-8.el7.i686 xerces-c-3.1.1-8.el7_2.i686
(gdb) where
#0  0xf77d0430 in __kernel_vsyscall ()
#1  0xf6af8147 in raise () from /lib/libc.so.6
#2  0xf6af9a52 in abort () from /lib/libc.so.6
#3  0xf6964bac in replace_timer_deadline (timer=<optimized out>, tm=<optimized 
out>)
    at 
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll_timer.c:375
#4  pni_timer_set (timer=<optimized out>, deadline=<optimized out>) at 
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll_timer.c:267
#5  0xf6964f7d in pconnection_tick (pc=pc@entry=0xf3a06a00) at 
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll.c:1433
#6  0xf696692e in pconnection_process (pc=pc@entry=0xf3a06a00, 
events=<optimized out>, events@entry=1, sched_ready=sched_ready@entry=false, 
topup=false)
    at 
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll.c:1211
#7  0xf6968237 in process (tsk=0xf3a06a00) at 
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll.c:2213
#8  next_event_batch (p=0xa4ec218, can_block=true) at 
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/c/src/proactor/epoll.c:2423
#9  0xf6e14b9d in proton::container::impl::thread() (this=0xa4ec168)
    at 
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/cpp/src/proactor_container_impl.cpp:760
#10 0xf6e1536d in proton::container::impl::run(int) (this=0xa4ec168, threads=1)
    at 
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/cpp/src/proactor_container_impl.cpp:812
#11 0xf6e020be in proton::container::run() (this=0xa4ec034) at 
/opt/jenkins_home/workspace/PRL/Qpid_release_build/qpid-proton-src/cpp/src/container.cpp:92
#12 0x0868eeb2 in pbamq::Communicator::ProtonContainerThread (this=0xa4ec008) 
at /usr/src/prl/src_0/vptlib/postbus/amq/communicator.cxx:657
#13 0xf6e28f5d in execute_native_thread_routine () from 
/opt/qpid-itr/lib/libqpid-proton-cpp.so.12
#14 0xf77aeb1c in start_thread () from /lib/libpthread.so.0
#15 0xf6bc745e in clone () from /lib/libc.so.6
{code}

> Simultaneous idle timeout sequencing errors
> -------------------------------------------
>
>                 Key: PROTON-2411
>                 URL: https://issues.apache.org/jira/browse/PROTON-2411
>             Project: Qpid Proton
>          Issue Type: Bug
>            Reporter: Jaap Wiggelinkhuizen
>            Priority: Major
>
> In our mission critical software we use Qpid proton 0.34.0 in our C++-client 
> software together with the Qpid dispatch router 1.16.0. We updated to these 
> versions not so long ago, before we used proton 0.25.0 and dispatch 1.3.0. 
> Our application runs on several VM’s with a router on each VM. All clients 
> connect to the local router only and the routers connect to eachother in a 
> hub spoke pattern. In both the client configuration as the router 
> configuration we have configured an idle timeout of 30 seconds.
> On July 4th we were confronted with an incident in production where a lot of 
> our client processes reported problems regarding the idle timeouts. These 
> client processes were already running stable for more than 3 weeks. The 
> problem appeared in two flavors:
>  # Transport error “error: amqp:resource-limit-exceeded: local-idle-timeout 
> expired”
>  # epoll proactor failure in epoll_timer.c:263: “idle timeout sequencing 
> error”
> On each VM at least 3 processes showed one of these problems in a total time 
> window of less than a minute. We haven’t found any cause in the underlying 
> hardware, hypervisor, network or operating system until now.
> Although we don’t know the root cause of the problems, we can solve the first 
> situation by using the proper reconnect settings (by mistake we handled 
> on_transport_error() as a fatal situation and will correct that so that only 
> on_transport_close() will be handled as fatal). However the second situation 
> is more odd because it results in an abort within proton itself. The comments 
> in epoll_timer.c explain that this error occurs when a connection timer is 
> moved backwards a second time. We don’t understand how this can happen 
> suddenly.
>  
> Last sunday the problem occurred again on two more production sites where our 
> software was operational just over 3 weeks now. And again it has happened on 
> all VM's within a short timeframe. It's interesting that it only occurs on 
> sunday mornings until now. Maybe it has something to do with how long the 
> software is running and the fact that on sunday mornings there is less 
> messaging traffic, i.e. more heartbeats?...
>  
> Unfortunately we haven't been able to reproduce the issue at our test 
> facilities and hence can not provide a reproducer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to