[ 
https://issues.apache.org/jira/browse/DISPATCH-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16478735#comment-16478735
 ] 

Marcel Meulemans commented on DISPATCH-966:
-------------------------------------------

Sorry for the show response, but now I finally got around to doing some follow 
up on this. Turns out the python exceptions are a side effect, not the actual 
problem (however I'll try to reproduce the stack trace later if only to improve 
the python response to the situation).

The actual problem is cause by this code (as far as I can see): 
[https://github.com/apache/qpid-dispatch/blob/master/src/message.c#L1168] ... 
In my situation with 10000 clients (each with two unique addresses), the MAU 
messages exchanged between routers can become quite large, so large that the 
limit set on the number of msg->content->buffers 
(qd_message_Q2_holdoff_should_block) is hit. This holdoff is unblocked when 
buffers are freed up by sending them out, but as the MAU message is not being 
sent out the holdoff is never unblocked. As a consequence all communication on 
this link comes to a halt (some message still arrive on the link until the 
credit is used up, but are never processed by the router code) and eventually 
the network breaks down. It seems to me that this blocking should not occur on 
messages that are not going to be send out. I verified my theory by increasing 
QD_QLIMIT_Q2_UPPER and observing that the problem goes away, but that is of 
course not a correct solution. I don't know enough about the router internals 
to propose a solution other than the qd_message_Q2_holdoff_should_block 
implementation 
([https://github.com/apache/qpid-dispatch/blob/master/src/message.c#L1950)] 
should probably also take into account that not all messages are sent out to 
other destinations.

Btw, I have not been able to figure out how this leads to the initial error 
"Deliveries to a multicast address must be pre-settled". What I did notice it 
that proton trace logging is showing inconsistent settlement flag for messages 
that are split over multiple transfer frames (see 
[^inconsistent-settlement.log]).

> Qpid dispatch unstable inter-router connections
> -----------------------------------------------
>
>                 Key: DISPATCH-966
>                 URL: https://issues.apache.org/jira/browse/DISPATCH-966
>             Project: Qpid Dispatch
>          Issue Type: Bug
>          Components: Routing Engine
>    Affects Versions: 1.0.1
>            Reporter: Marcel Meulemans
>            Assignee: Ted Ross
>            Priority: Blocker
>             Fix For: 1.1.0
>
>         Attachments: inconsistent-settlement.log, 
> qdrouterd-unsettled-true.log, qdrouterd.conf, qdrouterd.log, 
> router-unsettled-true.dump, router.dump
>
>
> I am running a three node fully connected mesh of dispatch routers with 10000 
> attached clients and I am seeing some unstable inter-router connections (I am 
> sending around 1000 small, less than 1K, messages per second through the 
> network). The inter-router connections fail every so many seconds with the 
> message:
> {{Connection to router-2:55672 failed: amqp:session:invalid-field sequencing 
> error, expected delivery-id 7, got 6}}
> (the numbers 7 and 6 differ per connection loss)
> In wireshark, using the attached tcpdump capture, I can see that every time 
> before the inter router connection is dropped, therw is a rejected 
> disposition with the message:
> {{Condition: qd:forbidden}}
> {{Description: Deliveries to a multicast address must be pre-settled}}
> The routers are connected as follows:
>  * router-0 -> router-1
>  * router-0 -> router-2
>  * router-1 -> router-2
> The routers are running as a docker container (debian stretch) on google 
> compute engine machines (every router on a separate node).
> Attached are:
>  * my qdrouter.conf (from one of the routers)
>  * a log snippet from router-0 at debug level from connection drop to 
> connection re-established to connection drop again.
>  * a tcpdump capture of the inter-router connection between router-0 and 
> router-1 during which several of the failures occur
> Versions:
>  * qpid-dispatch@1.0.1-rc1
>  * qpid-proton@0.20.0
>  
> [^qdrouterd.log]
> [^qdrouterd.conf]
> [^router.dump]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org

Reply via email to