[
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619974#comment-16619974
]
Joseph Lynch edited comment on CASSANDRA-14747 at 9/19/18 2:08 AM:
-------------------------------------------------------------------
Things went much better today, after the queue fixes we no longer ran out of
memory, but the {{MessagingService-NettyOutbound-Thread}} s would be pinned at
100% cpu. We (Jason, Jordan, myself, etc) tracked it down to various
unfortunate looping behaviors in the {{OutboundMessagingConnection}} class.
We're following up with various fixes to these queueing problems. I've attached
flame graphs and ttop outputs showing what's going on on the latest version of
{{jasobrown/14503-collab}} branch.
We think a few things are going on here:
# When the outbound queues get backed up we enter various long (sometimes
infinite) loops. We're working on stopping those
# Since we're multiplexing multiple nodes onto one outbound thread, we can
have multi-tenant queues where one slow consumer hurts other nodes as well.
We're working on a fix for this.
was (Author: jolynch):
Things went much better today, after the queue fixes we no longer ran out of
memory, but the {{MessagingService-NettyOutbound-Thread}} s would be pinned at
100% cpu. We (Jason, Jordan, myself, etc) tracked it down to various
unfortunate looping behaviors in the {{OutboundMessagingConnection}} class.
We're following up with various fixes to these queueing problems. I've attached
flame graphs and ttop outputs showing what's going on on the latest version of
{{jasobrown/14503-collab}} branch.
> Evaluate 200 node, compression=none, encryption=none, coalescing=off
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
> Issue Type: Sub-task
> Reporter: Joseph Lynch
> Assignee: Joseph Lynch
> Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png,
> 4.0_errors_showing_heap_pressure.txt,
> 4.0_heap_histogram_showing_many_MessageOuts.txt,
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg,
> ttop_NettyOutbound-Thread_spinning.txt,
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no
> compression, no encryption, no coalescing).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]