[
https://issues.apache.org/jira/browse/DISPATCH-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16478735#comment-16478735
]
Marcel Meulemans commented on DISPATCH-966:
-------------------------------------------
Sorry for the show response, but now I finally got around to doing some follow
up on this. Turns out the python exceptions are a side effect, not the actual
problem (however I'll try to reproduce the stack trace later if only to improve
the python response to the situation).
The actual problem is cause by this code (as far as I can see):
[https://github.com/apache/qpid-dispatch/blob/master/src/message.c#L1168] ...
In my situation with 10000 clients (each with two unique addresses), the MAU
messages exchanged between routers can become quite large, so large that the
limit set on the number of msg->content->buffers
(qd_message_Q2_holdoff_should_block) is hit. This holdoff is unblocked when
buffers are freed up by sending them out, but as the MAU message is not being
sent out the holdoff is never unblocked. As a consequence all communication on
this link comes to a halt (some message still arrive on the link until the
credit is used up, but are never processed by the router code) and eventually
the network breaks down. It seems to me that this blocking should not occur on
messages that are not going to be send out. I verified my theory by increasing
QD_QLIMIT_Q2_UPPER and observing that the problem goes away, but that is of
course not a correct solution. I don't know enough about the router internals
to propose a solution other than the qd_message_Q2_holdoff_should_block
implementation
([https://github.com/apache/qpid-dispatch/blob/master/src/message.c#L1950)]
should probably also take into account that not all messages are sent out to
other destinations.
Btw, I have not been able to figure out how this leads to the initial error
"Deliveries to a multicast address must be pre-settled". What I did notice it
that proton trace logging is showing inconsistent settlement flag for messages
that are split over multiple transfer frames (see
[^inconsistent-settlement.log]).
> Qpid dispatch unstable inter-router connections
> -----------------------------------------------
>
> Key: DISPATCH-966
> URL: https://issues.apache.org/jira/browse/DISPATCH-966
> Project: Qpid Dispatch
> Issue Type: Bug
> Components: Routing Engine
> Affects Versions: 1.0.1
> Reporter: Marcel Meulemans
> Assignee: Ted Ross
> Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: inconsistent-settlement.log,
> qdrouterd-unsettled-true.log, qdrouterd.conf, qdrouterd.log,
> router-unsettled-true.dump, router.dump
>
>
> I am running a three node fully connected mesh of dispatch routers with 10000
> attached clients and I am seeing some unstable inter-router connections (I am
> sending around 1000 small, less than 1K, messages per second through the
> network). The inter-router connections fail every so many seconds with the
> message:
> {{Connection to router-2:55672 failed: amqp:session:invalid-field sequencing
> error, expected delivery-id 7, got 6}}
> (the numbers 7 and 6 differ per connection loss)
> In wireshark, using the attached tcpdump capture, I can see that every time
> before the inter router connection is dropped, therw is a rejected
> disposition with the message:
> {{Condition: qd:forbidden}}
> {{Description: Deliveries to a multicast address must be pre-settled}}
> The routers are connected as follows:
> * router-0 -> router-1
> * router-0 -> router-2
> * router-1 -> router-2
> The routers are running as a docker container (debian stretch) on google
> compute engine machines (every router on a separate node).
> Attached are:
> * my qdrouter.conf (from one of the routers)
> * a log snippet from router-0 at debug level from connection drop to
> connection re-established to connection drop again.
> * a tcpdump capture of the inter-router connection between router-0 and
> router-1 during which several of the failures occur
> Versions:
> * [email protected]
> * [email protected]
>
> [^qdrouterd.log]
> [^qdrouterd.conf]
> [^router.dump]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]