[
https://issues.apache.org/jira/browse/PROTON-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16467645#comment-16467645
]
Alan Conway edited comment on PROTON-1842 at 5/8/18 4:52 PM:
-------------------------------------------------------------
The threaderciser is showing races in connection close, I'm not sure if they
are the same issue we are looking at here. Attached output race.vg and
race.tsan from helgrind and the thread sanitizer. Valigrind detects a *lot*
more races, probaby because it is slowing things down so much, but the tsan
stack traces are consistent with valgrind.
This looks consistent with your theory, in particular a mutex being destroyed
concurrently with being unlocked during shutdown. One thread locks, sees
everything is ready to finalize and destroys the connection state while the
second thread is blocked on the mutex - it gets released when the first thread
unlocks before pthread_destroy but explodes when it tries to unlock after the
destroy.
To run:
{code:java}
cmake -DTHREADERCISER=ON .. && make && valgrind --tool=helgrind
c/tests/c-threaderciser -time 60
cmake -DENABLE_TSAN=ON -DTHREADERCISER=ON .. && make && c/tests/c-threaderciser
-time 60{code}
was (Author: aconway):
The threaderciser is showing races in connection close, I'm not sure if they
are the same issue we are looking at here. Attached output race.vg and
race.tsan from helgrind and the thread sanitizer. Valigrind detects a *lot*
more races, probaby because it is slowing things down so much, but the tsan
stack traces are consistent with valgrind.
This looks consistent with your theory, in particular a mutex being destroyed
concurrently with being unlocked during shutdown. One thread locks, sees
everything is ready to finalize and destroys the connection state while the
second thread is blocked on the mutex - it gets released when the first thread
unlocks before pthread_destroy but explodes when it tries to unlock after the
destroy.
> [c] Dispatch/Proton crashes when opening/closing connections
> ------------------------------------------------------------
>
> Key: PROTON-1842
> URL: https://issues.apache.org/jira/browse/PROTON-1842
> Project: Qpid Proton
> Issue Type: Bug
> Components: proton-c
> Affects Versions: proton-c-0.22.0
> Reporter: Chuck Rolke
> Priority: Major
> Attachments: helloworld.cpp, race.tsan, race.vg
>
>
> Using proton cpp example code that is modified to open and close connections
> by the thousands in the main loop and having the event loop short circuit any
> messaging with:
> {{ void on_connection_open(proton::connection& c) {}}
> {{ c.close();}}
> {{ }}}
> and then directing this client example to a dispatch router 1.1.0. Eventually
> (after 100,000 to 1,000,000 connection open/closes) the router crashes with:
> {{qdrouterd: /home/chug/git/qpid-proton/c/src/proactor/epoll.c:466:
> wake_pop_front: Assertion `p->wakes_in_progress' failed.}}
> and with:
> {{qdrouterd: /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2014:
> proactor_do_epoll: Assertion `ee->type == PCONNECTION_TIMER' failed.}}
> This issue seems to happen only with qpid-dispatch accepting the open/close
> event stream. Proton cpp example _server_direct_ and c example _direct_ work
> properly with the same open/close event stream mounting into the 10s of
> millions of connections.
> A core dump backtrace with the PCONNECTION_TIMER failure reads as:
> {{(gdb) bt}}
> {{#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51}}
> {{#1 0x00007f795c712c41 in __GI_abort () at abort.c:79}}
> {{#2 0x00007f795c709f7a in __assert_fail_base (fmt=0x7f795c85a260
> "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
> assertion=assertion@entry=0x7f795d72e15a "ee->type == PCONNECTION_TIMER", }}
> {{ file=file@entry=0x7f795d72de98
> "/home/chug/git/qpid-proton/c/src/proactor/epoll.c", line=line@entry=2014, }}
> {{ function=function@entry=0x7f795d72e320 <__PRETTY_FUNCTION__.6307>
> "proactor_do_epoll") at assert.c:92}}
> {{#3 0x00007f795c709ff2 in __GI___assert_fail (assertion=0x7f795d72e15a
> "ee->type == PCONNECTION_TIMER", file=0x7f795d72de98
> "/home/chug/git/qpid-proton/c/src/proactor/epoll.c", line=2014, }}
> {{ function=0x7f795d72e320 <__PRETTY_FUNCTION__.6307> "proactor_do_epoll")
> at assert.c:101}}
> {{#4 0x00007f795d72d29f in proactor_do_epoll (p=0x26b7310, can_block=true)
> at /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2014}}
> {{#5 0x00007f795d72d30e in pn_proactor_wait (p=0x26b7310) at
> /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2030}}
> {{#6 0x00007f795dbe89ad in thread_run (arg=0x26be750) at
> /home/chug/git/qpid-dispatch/src/server.c:946}}
> {{#7 0x00007f795d50e50b in start_thread (arg=0x7f794f486700) at
> pthread_create.c:465}}
> {{#8 0x00007f795c7d216f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95}}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]