[
https://issues.apache.org/jira/browse/CASSANDRA-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304836#comment-17304836
]
Benjamin Lerer commented on CASSANDRA-14764:
--------------------------------------------
+1 I will close the ticket.
> Test Messaging Refactor with: 12 Node Breaking Point, compression=none,
> encryption=none, coalescing=off
> -------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-14764
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14764
> Project: Cassandra
> Issue Type: Sub-task
> Components: Legacy/Streaming and Messaging
> Reporter: Joey Lynch
> Assignee: Vinay Chella
> Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: i-03341e1c52de6ea3e-after-queue-change.svg,
> i-07cd92e844d66d801-after-queue-bound.svg, i-07cd92e844d66d801-hint-play.svg,
> i-07cd92e844d66d801-uninlined-with-jvm-methods.svg, ttop.txt
>
>
> *Setup:*
> * Cassandra: 12 (2*6) node i3.xlarge AWS instance (4 cpu cores, 30GB ram)
> running cassandra trunk off of jasobrown/14503 jdd7ec5a2 (Jasons patched
> internode messaging branch) vs the same footprint running 3.0.17
> * Two datacenters with 100ms latency between them
> * No compression, encryption, or coalescing turned on
> *Test #1:*
> ndbench sent 1.5k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so
> 3k global replica QPS) of 4kb single partition BATCH mutations at LOCAL_ONE.
> This represents about 250 QPS per coordinator in the first datacenter or 60
> QPS per core. The goal was to observe P99 write and read latencies under
> various QPS.
> *Result:*
> The good news is since the CASSANDRA-14503 changes, instead of keeping the
> mutations on heap we put the message into hints instead and don't run out of
> memory. The bad news is that the {{MessagingService-NettyOutbound-Thread's}}
> would occasionally enter a degraded state where they would just spin on a
> core. I've attached flame graphs showing the CPU state as [~jasobrown]
> applied fixes to the {{OutboundMessagingConnection}} class.
> *Follow Ups:*
> [~jasobrown] has committed a number of fixes onto his
> {{jasobrown/14503-collab}} branch including:
> 1. Limiting the amount of time spent dequeuing messages if they are expired
> (previously if messages entered the queue faster than we could dequeue them
> we'd just inifinte loop on the consumer side)
> 2. Don't call {{dequeueMessages}} from within {{dequeueMessages}} created
> callbacks.
> We're continuing to use CPU flamegraphs to figure out where we're looping and
> fixing bugs as we find them.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]