[jira] [Commented] (CASSANDRA-14764) Test Messaging Refactor with: 12 Node Breaking Point, compression=none, encryption=none, coalescing=off

Benjamin Lerer (Jira) Fri, 19 Mar 2021 04:09:07 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304836#comment-17304836
 ]


Benjamin Lerer commented on CASSANDRA-14764:
--------------------------------------------

+1 I will close the ticket.

> Test Messaging Refactor with: 12 Node Breaking Point, compression=none, 
> encryption=none, coalescing=off
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14764
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14764
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Legacy/Streaming and Messaging
>            Reporter: Joey Lynch
>            Assignee: Vinay Chella
>            Priority: Normal
>             Fix For: 4.0-rc
>
>         Attachments: i-03341e1c52de6ea3e-after-queue-change.svg, 
> i-07cd92e844d66d801-after-queue-bound.svg, i-07cd92e844d66d801-hint-play.svg, 
> i-07cd92e844d66d801-uninlined-with-jvm-methods.svg, ttop.txt
>
>
> *Setup:*
>  * Cassandra: 12 (2*6) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
> running cassandra trunk off of jasobrown/14503 jdd7ec5a2 (Jasons patched 
> internode messaging branch) vs the same footprint running 3.0.17
>  * Two datacenters with 100ms latency between them
>  * No compression, encryption, or coalescing turned on
> *Test #1:*
> ndbench sent 1.5k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
> 3k global replica QPS) of 4kb single partition BATCH mutations at LOCAL_ONE. 
> This represents about 250 QPS per coordinator in the first datacenter or 60 
> QPS per core. The goal was to observe P99 write and read latencies under 
> various QPS.
> *Result:*
> The good news is since the CASSANDRA-14503 changes, instead of keeping the 
> mutations on heap we put the message into hints instead and don't run out of 
> memory. The bad news is that the {{MessagingService-NettyOutbound-Thread's}} 
> would occasionally enter a degraded state where they would just spin on a 
> core. I've attached flame graphs showing the CPU state as [~jasobrown] 
> applied fixes to the {{OutboundMessagingConnection}} class.
>  *Follow Ups:*
> [~jasobrown] has committed a number of fixes onto his 
> {{jasobrown/14503-collab}} branch including:
> 1. Limiting the amount of time spent dequeuing messages if they are expired 
> (previously if messages entered the queue faster than we could dequeue them 
> we'd just inifinte loop on the consumer side)
> 2. Don't call {{dequeueMessages}} from within {{dequeueMessages}} created 
> callbacks.
> We're continuing to use CPU flamegraphs to figure out where we're looping and 
> fixing bugs as we find them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-14764) Test Messaging Refactor with: 12 Node Breaking Point, compression=none, encryption=none, coalescing=off

Reply via email to