[
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Semb Wever updated CASSANDRA-15430:
-------------------------------------------
Attachment: jfr_jmc_3-11_obj.png
> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations
> compared to 2.1.18
> ----------------------------------------------------------------------------------------
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
> Issue Type: Bug
> Reporter: Thomas Steinmaurer
> Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png,
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, jfr_jmc_3-11.png,
> jfr_jmc_3-11_obj.png, jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png,
> mutation_stage.png, screenshot-1.png, screenshot-2.png, screenshot-3.png,
> screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain
> production-like workload constantly and sufficiently. After upgrading one
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of
> regression described below), 3.0.18 is showing increased CPU usage, increase
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
> * Bare metal, 32 physical cores, 512G RAM
> * Xmx31G, G1, max pause millis = 2000ms
> * cassandra.yaml basically unchanged, thus same settings in regard to number
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics
> for all 6 nodes and the one outlier being the node upgraded to Cassandra
> 3.0.18.
> !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage
> after the upgrade:
> !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 -
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout
> and 0 for cross node timeout
> INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool
> Name Active Pending Completed Blocked All Time
> Blocked
> INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 -
> MutationStage 256 81824 3360532756 0
> 0
> INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 -
> ViewMutationStage 0 0 0 0
> 0
> INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 -
> ReadStage 0 0 62862266 0
> 0
> INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 -
> RequestResponseStage 0 0 2176659856 0
> 0
> INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 -
> ReadRepairStage 0 0 0 0
> 0
> INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 -
> CounterMutationStage 0 0 0 0
> 0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different
> node, high-level, it looks like the code path underneath
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in
> 3.0.18 compared to 2.1.18.
> !jfr_allocations.png!
> Left => 3.0.18
> Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I
> can upload them, if there is another destination available.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]