[
https://issues.apache.org/jira/browse/DISPATCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Giusti resolved DISPATCH-1256.
----------------------------------
Fix Version/s: 1.9.0
Resolution: Fixed
We've identified the bulk of the inter-router bottleneck is due to heavy
processing of Message Annotations on the inter-router links.
Release 1.9.0 contains a number of optimizations to improve inter-router
forwarding, however further improvements to the MA processing codepath can be
made.
This work is being tracked in DISPATCH-1389
> Throughput gets worse as more routers are added
> -----------------------------------------------
>
> Key: DISPATCH-1256
> URL: https://issues.apache.org/jira/browse/DISPATCH-1256
> Project: Qpid Dispatch
> Issue Type: Bug
> Reporter: Alan Conway
> Assignee: Ted Ross
> Priority: Major
> Fix For: 1.9.0
>
> Attachments: reproducer.tgz
>
>
> Creating a chain of routers gives worse throughput as the chain gets longer.
> Reproducer:
> Install quiver from here: https://github.com/alanconway/quiver (extened with
> --send-url)
> Run the attached script like this:
> qdr-chain.sh quiver --arrow c --count 1000 --send-url //:10001/testme
> //:10004/testme
> Sets up a chain of 4 routers, each connected to the previous. Waits for
> qdstat -nv at router 1 to show 4 routers in the topology. Receive messages
> from address 'testme' at router 4, send to address 'testme' at router 1.
> This runs the routers with enough tracing to run the dispatch scraper tool,
> which may affect the results. However 1 million message runs without tracing
> show a drop-off in throughput going from 1m0 (no router) up to 1m4 (4-router
> chain) - the more routers the worse it gets. This seems consistent with the
> credit delays seen in the traced case.
> 1m0/out.txt:End-to-end rate ..................................... 296,824
> messages/s
> 1m0/out.txt: 50% ........ 2 ms 99.90% ........ 6 ms
> 1m1/out.txt:End-to-end rate ..................................... 133,851
> messages/s
> 1m1/out.txt: 50% ........ 1 ms 99.90% ........ 3 ms
> 1m2/out.txt:End-to-end rate ...................................... 59,011
> messages/s
> 1m2/out.txt: 50% ........ 2 ms 99.90% ........ 8 ms
> 1m3/out.txt:End-to-end rate ...................................... 42,146
> messages/s
> 1m3/out.txt: 50% ........ 3 ms 99.90% ....... 11 ms
> 1m4/out.txt:End-to-end rate ...................................... 32,415
> messages/s
> 1m4/out.txt: 50% ........ 4 ms 99.90% ....... 13 ms
> The output (config, logs and scrapes) from a 1000-message, 4-router tracing
> run is included, as well as the script to generate new data. Runs of 10k
> messages are feasible, but the HTML gets large and slow to view. From my
> experiments the results are similar.
> From file:scrape.html#c_connections the sender connection is starved of
> credit far more than any other connection: 20x more starvation instances, 40x
> longer total starvation and 100x longer wait for initial credit.
> Overall the flow of credit is slower than the flow of messages, and each
> additional router is reducing the throughput of credits, which reduces
> message throughput.
> Observations:
> - initial credit delay is excessive (100ms in this case when worst-case
> message latency is 33ms), something is wrong there.
> - inter-router link credit is small (250) compared to application credit
> (1000). Giving full credit to a sender creates a credit sharing problem,
> but we could give full credit to the inter-router links.
> - re-evaluate our credit sharing strategy to perform better in the common
> unshared case.
> Extracts from scrape:
> ◊ 2019-01-23 13:55:16.524477 D0#141 4 [D0_3] <- [] quiver-receiver-2ebb30e5
> flow [0,0] (0,1000)
> ◊ 2019-01-23 13:55:16.633700 A0#287 1 [A0_6] -> [] quiver-sender-3db6ad3f
> flow [0,0] (0,250)
> Initial delay 109ms, much worse than worst-case message latency of 33ms
> ◊ 2019-01-23 13:55:16.639758 D0#153 4 [D0_3] <- [] quiver-receiver-2ebb30e5
> flow [0,0] (3,1000)
> ◊ 2019-01-23 13:55:16.642682 A0#543 1 [A0_6] -> [] quiver-sender-3db6ad3f
> flow [0,0] (168,85)
> ◊ 2019-01-23 13:55:16.645205 D0#163 4 [D0_3] <- [] quiver-receiver-2ebb30e5
> flow [0,0] (5,1000)
> ◊ 2019-01-23 13:55:16.645448 D0#167 4 [D0_3] <- [] quiver-receiver-2ebb30e5
> flow [0,0] (6,1000)
> ◊ 2019-01-23 13:55:16.646510 A0#713 1 [A0_6] <- [] quiver-sender-3db6ad3f
> transfer [0,0] (252) transfer_281 - 158 bytes no more credit (accepted
> settled 0.028323 S)
> ◊ 2019-01-23 13:55:16.647142 D0#193 4 [D0_3] <- [] quiver-receiver-2ebb30e5
> flow [0,0] (7,1000)
> ◊ 2019-01-23 13:55:16.647198 D0#195 4 [D0_3] <- [] quiver-receiver-2ebb30e5
> flow [0,0] (10,1000)
> ◊ 2019-01-23 13:55:16.647915 D0#214 4 [D0_3] <- [] quiver-receiver-2ebb30e5
> flow [0,0] (16,1000)
> ◊ 2019-01-23 13:55:16.648060 D0#217 4 [D0_3] <- [] quiver-receiver-2ebb30e5
> flow [0,0] (25,1000)
> ◊ 2019-01-23 13:55:16.651294 D0#239 4 [D0_3] <- [] quiver-receiver-2ebb30e5
> flow [0,0] (28,1000)
> ◊ 2019-01-23 13:55:16.652235 D0#257 4 [D0_3] <- [] quiver-receiver-2ebb30e5
> flow [0,0] (41,1000)
> ◊ 2019-01-23 13:55:16.654294 A0#802 1 [A0_6] -> [] quiver-sender-3db6ad3f
> flow [0,0] (253,9) credit restored
> Here we see that the receiver has given 9 credit responses in the same time
> that the front router only gave 2. Either the router is slow to forward
> messages, or it's not giving back credit quickly when it does (or both) See
> flow.txt for the complete history of flow events at the sender and receiver,
> scrape.html for the complete history of everything.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]