[ 
https://issues.apache.org/jira/browse/DISPATCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976254#comment-16976254
 ] 

Charles E. Rolke commented on DISPATCH-1475:
--------------------------------------------

After enough stack backtraces and log files a pattern to explain this segfault 
is emerging.
 * The router network has four interior routers A-B-C-D in a linear 
arragnement. Each interior router has two edge routers attached.
 * The test has a long-lived stream of multicast messages between a sender on 
an edge router going in to interior A to a receiver on an edge router coming 
out of interior D.
 * Then a series  of clients connect to one of the edge routers (EC1) going to 
interior C (INTC), receive a small number of messages, and then close their 
connections.
 * Eventually EC1 seg faults.
 * The link that is failing is on the interrouter connection between EC1 and 
INTC; it is not a client link trying to receive a few messages from the data 
stream at EC1.

Logically when one of the clients connects then EC1 opens a link to INTC to 
receive messages for the address. If more clients connect before the first 
disconnects then they all share the messages coming over the link to INTC. By 
chance when all the clients disconnect at the same time EC1 determines that the 
link to INTC is no longer necessary and closes it. This close is generated by 
edge address tracking and not by AMQP protocol events.The problem is that this 
link closure is not properly sequenced with the relentless stream of messages 
arriving over the interrouter link. Sometimes the link is closed internally and 
deleted as link activity on an IO thread is in flight on another thread. The IO 
thread then uses a link that has been returned to the free pool (all 99s) and 
segfaults.

IO threads calling qdr_connection_process need a scheme like safe_pointer to 
detect that the underlying link has been deleted and to react accordingly.

> Seg fault in qdr_link_cleanup_CT after 12,400+ connections
> ----------------------------------------------------------
>
>                 Key: DISPATCH-1475
>                 URL: https://issues.apache.org/jira/browse/DISPATCH-1475
>             Project: Qpid Dispatch
>          Issue Type: Bug
>          Components: Router Node
>    Affects Versions: 1.9.0
>         Environment: Two systems: Fedora 29
>            Reporter: Charles E. Rolke
>            Priority: Major
>         Attachments: DISPATCH-1475-core-writeup.txt
>
>
> Running millions of messages on network described in DISPATCH-1474. This 
> morning's dispatch master Debug build, and proton 0.29.0 Debug build.
> While a stream of unsettled multicast messages is flowing, then a separate 
> process connects to EC1, receives a few messages, and then disconnects.
> Eventually the EC1 edge router seg faults with qdr_link_cleanup_CT receiving 
> a conn=0x9999999999999999.
> This setup ran for hours before failing.
> For this command S_RECV is a softlink in my path for proton simple_receive.
> {{var=1; while true; do S_RECV -a $EC1_normal/multicast/q1 -m $var; 
> var=$((var+1)); done}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to