[
https://issues.apache.org/jira/browse/PROTON-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16468176#comment-16468176
]
Alan Conway edited comment on PROTON-1842 at 5/9/18 1:07 AM:
-------------------------------------------------------------
Another note, the latest threaderciser shows the race with flags "-listen
-connect -close-listen" so the only things that are racing here are IO events
from connection errors and procator-generated wakes - there are no user wakes
involved.
I am seeing a race betwee pn_proactor_done() (user thread) deciding to finalize
a connection, and an epoll thread waking up to process it. The epoll thread is
racing to lock the context mutex while the user thread is deleting it - I'm not
seeing a crash but it's clear that it could be a crash with the right timing.
Speculating: we need to bring back something like the ee->mutex to sync around
epoll mods and waits. The variables in
pconnection_is_final(pconnection_t *pc) {
return !pc->current_arm && !pc->timer_armed && !pc->context.wake_ops;
}
Need to be synchronized around epoll events, because right now it seems that
is_final can return true concurrently with epoll_wait returning the same pc, so
it seems like current_arm is not properly synced.
was (Author: aconway):
Another note, the latest threaderciser shows the race with flags "-listen
-connect -close-listen" so the only things that are racing here are IO events
from connection errors and procator-generated wakes - there are no user wakes
involved.
I am seeing a race betwee pn_proactor_done() (user thread) deciding to finalize
a connection, and an epoll thread waking up to process it. The epoll thread is
racing to lock the context mutex while the user thread is deleting it - I'm not
seeing a crash but it's clear that it could be a crash with the right timing.
> [c] Dispatch/Proton crashes when opening/closing connections
> ------------------------------------------------------------
>
> Key: PROTON-1842
> URL: https://issues.apache.org/jira/browse/PROTON-1842
> Project: Qpid Proton
> Issue Type: Bug
> Components: proton-c
> Affects Versions: proton-c-0.22.0
> Reporter: Chuck Rolke
> Priority: Major
> Attachments: helloworld.cpp, race.tsan, race.vg
>
>
> Using proton cpp example code that is modified to open and close connections
> by the thousands in the main loop and having the event loop short circuit any
> messaging with:
> {{ void on_connection_open(proton::connection& c) {}}
> {{ c.close();}}
> {{ }}}
> and then directing this client example to a dispatch router 1.1.0. Eventually
> (after 100,000 to 1,000,000 connection open/closes) the router crashes with:
> {{qdrouterd: /home/chug/git/qpid-proton/c/src/proactor/epoll.c:466:
> wake_pop_front: Assertion `p->wakes_in_progress' failed.}}
> and with:
> {{qdrouterd: /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2014:
> proactor_do_epoll: Assertion `ee->type == PCONNECTION_TIMER' failed.}}
> This issue seems to happen only with qpid-dispatch accepting the open/close
> event stream. Proton cpp example _server_direct_ and c example _direct_ work
> properly with the same open/close event stream mounting into the 10s of
> millions of connections.
> A core dump backtrace with the PCONNECTION_TIMER failure reads as:
> {{(gdb) bt}}
> {{#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51}}
> {{#1 0x00007f795c712c41 in __GI_abort () at abort.c:79}}
> {{#2 0x00007f795c709f7a in __assert_fail_base (fmt=0x7f795c85a260
> "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
> assertion=assertion@entry=0x7f795d72e15a "ee->type == PCONNECTION_TIMER", }}
> {{ file=file@entry=0x7f795d72de98
> "/home/chug/git/qpid-proton/c/src/proactor/epoll.c", line=line@entry=2014, }}
> {{ function=function@entry=0x7f795d72e320 <__PRETTY_FUNCTION__.6307>
> "proactor_do_epoll") at assert.c:92}}
> {{#3 0x00007f795c709ff2 in __GI___assert_fail (assertion=0x7f795d72e15a
> "ee->type == PCONNECTION_TIMER", file=0x7f795d72de98
> "/home/chug/git/qpid-proton/c/src/proactor/epoll.c", line=2014, }}
> {{ function=0x7f795d72e320 <__PRETTY_FUNCTION__.6307> "proactor_do_epoll")
> at assert.c:101}}
> {{#4 0x00007f795d72d29f in proactor_do_epoll (p=0x26b7310, can_block=true)
> at /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2014}}
> {{#5 0x00007f795d72d30e in pn_proactor_wait (p=0x26b7310) at
> /home/chug/git/qpid-proton/c/src/proactor/epoll.c:2030}}
> {{#6 0x00007f795dbe89ad in thread_run (arg=0x26be750) at
> /home/chug/git/qpid-dispatch/src/server.c:946}}
> {{#7 0x00007f795d50e50b in start_thread (arg=0x7f794f486700) at
> pthread_create.c:465}}
> {{#8 0x00007f795c7d216f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95}}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]