[
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634301#comment-16634301
]
Jason Brown commented on CASSANDRA-14747:
-----------------------------------------
[~jolynch] Nice work. I agree the time bounding of dequeueMessages is somewhat
questionable - I added it in when we were making a bunch of other changes for
dealing with CPU/task starvation.
In your gist, I think we can run into some serious overscheduling
(re-enqueueing of the consumer task) when the channel is unwritable. In that
case, it will break out of dequeueMessages's while loop immediately, but then
immediately reschedule (assuming backlog > 0). We'll keep doing this, very
aggressively, until the channel becomes writable again - yet we cannot make any
meaningful progress. To counteract this, that's why I had dequeueMessages not
reschedule, but instead had handleMessageResult reschedule because at that
point (remember, we only attach the listener to that last message of the bunch)
we know the bytes have been written to the socket and that channel should be
writable again. In this case we only schedule (or directly execute)
dequeueMessages when we need to. (Note: this was probably not apparent from the
current code's comments, so I should definitely improve that.)
> Evaluate 200 node, compression=none, encryption=none, coalescing=off
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
> Issue Type: Sub-task
> Reporter: Joseph Lynch
> Assignee: Joseph Lynch
> Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png,
> 4.0.11-after-jolynch-tweaks.svg, 4.0.7-before-my-changes.svg,
> 4.0_errors_showing_heap_pressure.txt,
> 4.0_heap_histogram_showing_many_MessageOuts.txt,
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg,
> ttop_NettyOutbound-Thread_spinning.txt,
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg,
> useast1e-i-08635fa1631601538_flamegraph_96node.svg,
> useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes,
> useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no
> compression, no encryption, no coalescing).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]