[
https://issues.apache.org/jira/browse/DISPATCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976254#comment-16976254
]
Charles E. Rolke commented on DISPATCH-1475:
--------------------------------------------
After enough stack backtraces and log files a pattern to explain this segfault
is emerging.
* The router network has four interior routers A-B-C-D in a linear
arragnement. Each interior router has two edge routers attached.
* The test has a long-lived stream of multicast messages between a sender on
an edge router going in to interior A to a receiver on an edge router coming
out of interior D.
* Then a series of clients connect to one of the edge routers (EC1) going to
interior C (INTC), receive a small number of messages, and then close their
connections.
* Eventually EC1 seg faults.
* The link that is failing is on the interrouter connection between EC1 and
INTC; it is not a client link trying to receive a few messages from the data
stream at EC1.
Logically when one of the clients connects then EC1 opens a link to INTC to
receive messages for the address. If more clients connect before the first
disconnects then they all share the messages coming over the link to INTC. By
chance when all the clients disconnect at the same time EC1 determines that the
link to INTC is no longer necessary and closes it. This close is generated by
edge address tracking and not by AMQP protocol events.The problem is that this
link closure is not properly sequenced with the relentless stream of messages
arriving over the interrouter link. Sometimes the link is closed internally and
deleted as link activity on an IO thread is in flight on another thread. The IO
thread then uses a link that has been returned to the free pool (all 99s) and
segfaults.
IO threads calling qdr_connection_process need a scheme like safe_pointer to
detect that the underlying link has been deleted and to react accordingly.
> Seg fault in qdr_link_cleanup_CT after 12,400+ connections
> ----------------------------------------------------------
>
> Key: DISPATCH-1475
> URL: https://issues.apache.org/jira/browse/DISPATCH-1475
> Project: Qpid Dispatch
> Issue Type: Bug
> Components: Router Node
> Affects Versions: 1.9.0
> Environment: Two systems: Fedora 29
> Reporter: Charles E. Rolke
> Priority: Major
> Attachments: DISPATCH-1475-core-writeup.txt
>
>
> Running millions of messages on network described in DISPATCH-1474. This
> morning's dispatch master Debug build, and proton 0.29.0 Debug build.
> While a stream of unsettled multicast messages is flowing, then a separate
> process connects to EC1, receives a few messages, and then disconnects.
> Eventually the EC1 edge router seg faults with qdr_link_cleanup_CT receiving
> a conn=0x9999999999999999.
> This setup ran for hours before failing.
> For this command S_RECV is a softlink in my path for proton simple_receive.
> {{var=1; while true; do S_RECV -a $EC1_normal/multicast/q1 -m $var;
> var=$((var+1)); done}}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]