[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-14 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Fix Version/s: (was: 4.0.x)
   (was: 3.11.x)
Since Version: 3.0 alpha 1

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Fix For: 3.0.x
>
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, 
> jfr_jmc_3-0_obj_obj_alloc.png, jfr_jmc_3-11.png, jfr_jmc_3-11_obj.png, 
> jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, mutation_stage.png, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-14 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Resolution: Duplicate  (was: Fixed)
Status: Resolved  (was: Open)

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x
>
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, 
> jfr_jmc_3-0_obj_obj_alloc.png, jfr_jmc_3-11.png, jfr_jmc_3-11_obj.png, 
> jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, mutation_stage.png, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-14 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Status: Open  (was: Resolved)

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x
>
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, 
> jfr_jmc_3-0_obj_obj_alloc.png, jfr_jmc_3-11.png, jfr_jmc_3-11_obj.png, 
> jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, mutation_stage.png, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-13 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15430:

Resolution: Fixed
Status: Resolved  (was: Open)

Closing as a duplicate of CASSANDRA-16201 - aiming to fix all these issues there

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x
>
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, 
> jfr_jmc_3-0_obj_obj_alloc.png, jfr_jmc_3-11.png, jfr_jmc_3-11_obj.png, 
> jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, mutation_stage.png, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-12 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
 Bug Category: Parent values: Degradation(12984)Level 1 values: Resource 
Management(12995)
   Complexity: Normal
  Component/s: Local/Other
Discovered By: User Report
Fix Version/s: 4.0.x
   3.11.x
   3.0.x
 Severity: Normal
   Status: Open  (was: Triage Needed)

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x
>
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, 
> jfr_jmc_3-0_obj_obj_alloc.png, jfr_jmc_3-11.png, jfr_jmc_3-11_obj.png, 
> jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, mutation_stage.png, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-12 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Attachment: jfr_jmc_3-0_obj_obj_alloc.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, 
> jfr_jmc_3-0_obj_obj_alloc.png, jfr_jmc_3-11.png, jfr_jmc_3-11_obj.png, 
> jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, mutation_stage.png, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-12 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Attachment: jfr_jmc_3-11_obj.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, jfr_jmc_3-11.png, 
> jfr_jmc_3-11_obj.png, jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, 
> mutation_stage.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-12 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Attachment: jfr_jmc_4-0-b2.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, jfr_jmc_3-11.png, 
> jfr_jmc_3-11_obj.png, jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, 
> mutation_stage.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-12 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Attachment: jfr_jmc_4-0-b2_obj.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, jfr_jmc_3-11.png, 
> jfr_jmc_3-11_obj.png, jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, 
> mutation_stage.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-12 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Attachment: jfr_jmc_3-11.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, jfr_jmc_3-11.png, 
> jfr_jmc_3-11_obj.png, jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, 
> mutation_stage.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-12 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Attachment: jfr_jmc_3-0.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, 
> mutation_stage.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-12 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Attachment: jfr_jmc_2-1_obj.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, 
> mutation_stage.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-12 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Attachment: jfr_jmc_2-1.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, 
> mutation_stage.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-10-12 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15430:
---
Attachment: jfr_jmc_3-0_obj.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, 
> jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, 
> mutation_stage.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-01-17 Thread Thomas Steinmaurer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Steinmaurer updated CASSANDRA-15430:
---
Attachment: screenshot-4.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-01-17 Thread Thomas Steinmaurer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Steinmaurer updated CASSANDRA-15430:
---
Attachment: screenshot-3.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-01-17 Thread Thomas Steinmaurer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Steinmaurer updated CASSANDRA-15430:
---
Attachment: screenshot-2.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, 
> screenshot-1.png, screenshot-2.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2020-01-17 Thread Thomas Steinmaurer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Steinmaurer updated CASSANDRA-15430:
---
Attachment: screenshot-1.png

> Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations 
> compared to 2.1.18
> 
>
> Key: CASSANDRA-15430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15430
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Thomas Steinmaurer
>Priority: Normal
> Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, 
> screenshot-1.png
>
>
> In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
> production-like workload constantly and sufficiently. After upgrading one 
> node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
> regression described below), 3.0.18 is showing increased CPU usage, increase 
> GC, high mutation stage pending tasks, dropped mutation messages ...
> Some spec. All 6 nodes equally sized:
>  * Bare metal, 32 physical cores, 512G RAM
>  * Xmx31G, G1, max pause millis = 2000ms
>  * cassandra.yaml basically unchanged, thus same settings in regard to number 
> of threads, compaction throttling etc.
> Following dashboard shows highlighted areas (CPU, suspension) with metrics 
> for all 6 nodes and the one outlier being the node upgraded to Cassandra 
> 3.0.18.
>  !dashboard.png|width=1280!
> Additionally we see a large increase on pending tasks in the mutation stage 
> after the upgrade:
>  !mutation_stage.png!
> And dropped mutation messages, also confirmed in the Cassandra log:
> {noformat}
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
> MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout 
> and 0 for cross node timeout
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
> NameActive   Pending  Completed   Blocked  All Time 
> Blocked
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> MutationStage   256 81824 3360532756 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ViewMutationStage 0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
> ReadStage 0 0   62862266 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> RequestResponseStage  0 0 2176659856 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> ReadRepairStage   0 0  0 0
>  0
> INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
> CounterMutationStage  0 0  0 0
>  0
> ...
> {noformat}
> Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
> node, high-level, it looks like the code path underneath 
> {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 
> 3.0.18 compared to 2.1.18.
>  !jfr_allocations.png!
> Left => 3.0.18
>  Right => 2.1.18
> JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
> can upload them, if there is another destination available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2019-11-15 Thread Thomas Steinmaurer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Steinmaurer updated CASSANDRA-15430:
---
Description: 
In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
production-like workload constantly and sufficiently. After upgrading one node 
to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
regression described below), 3.0.18 is showing increased CPU usage, increase 
GC, high mutation stage pending tasks, dropped mutation messages ...

Some spec. All 6 nodes equally sized:
 * Bare metal, 32 physical cores, 512G RAM
 * Xmx31G, G1, max pause millis = 2000ms
 * cassandra.yaml basically unchanged, thus same settings in regard to number 
of threads, compaction throttling etc.

Following dashboard shows highlighted areas (CPU, suspension) with metrics for 
all 6 nodes and the one outlier being the node upgraded to Cassandra 3.0.18.
 !dashboard.png|width=1280!

Additionally we see a large increase on pending tasks in the mutation stage 
after the upgrade:
 !mutation_stage.png!

And dropped mutation messages, also confirmed in the Cassandra log:
{noformat}
INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 
0 for cross node timeout
INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
NameActive   Pending  Completed   Blocked  All Time 
Blocked
INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
MutationStage   256 81824 3360532756 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
ViewMutationStage 0 0  0 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
ReadStage 0 0   62862266 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
RequestResponseStage  0 0 2176659856 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
ReadRepairStage   0 0  0 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
CounterMutationStage  0 0  0 0  
   0
...
{noformat}
Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
node, high-level, it looks like the code path underneath 
{{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 3.0.18 
compared to 2.1.18.
 !jfr_allocations.png!

Left => 3.0.18
 Right => 2.1.18

JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
can upload them, if there is another destination available.

  was:
In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
production-like workload constantly and sufficiently. After upgrading one node 
to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
regression described below), 3.0.18 is showing increased CPU usage, increase 
GC, high mutation stage pending tasks, dropped mutation messages ...

Some spec. All 6 nodes equally sized:
 * Bare metal, 32 physical cores, 512G RAM
 * Xmx31G, G1, max pause millis = 2000ms
 * cassandra.yaml basically unchanged, thus some settings in regard to number 
of threads, compaction throttling etc.

Following dashboard shows highlighted areas (CPU, suspension) with metrics for 
all 6 nodes and the one outlier being the node upgraded to Cassandra 3.0.18.
 !dashboard.png|width=1280!

Additionally we see a large increase on pending tasks in the mutation stage 
after the upgrade:
 !mutation_stage.png!

And dropped mutation messages, also confirmed in the Cassandra log:
{noformat}
INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 
0 for cross node timeout
INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
NameActive   Pending  Completed   Blocked  All Time 
Blocked
INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
MutationStage   256 81824 3360532756 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
ViewMutationStage 0 0  0 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
ReadStage 0 0   62862266 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
RequestResponseStage  0 0 2176659856  

[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18

2019-11-15 Thread Thomas Steinmaurer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Steinmaurer updated CASSANDRA-15430:
---
Description: 
In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
production-like workload constantly and sufficiently. After upgrading one node 
to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
regression described below), 3.0.18 is showing increased CPU usage, increase 
GC, high mutation stage pending tasks, dropped mutation messages ...

Some spec. All 6 nodes equally sized:
 * Bare metal, 32 physical cores, 512G RAM
 * Xmx31G, G1, max pause millis = 2000ms
 * cassandra.yaml basically unchanged, thus some settings in regard to number 
of threads, compaction throttling etc.

Following dashboard shows highlighted areas (CPU, suspension) with metrics for 
all 6 nodes and the one outlier being the node upgraded to Cassandra 3.0.18.
 !dashboard.png|width=1280!

Additionally we see a large increase on pending tasks in the mutation stage 
after the upgrade:
 !mutation_stage.png!

And dropped mutation messages, also confirmed in the Cassandra log:
{noformat}
INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 
0 for cross node timeout
INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
NameActive   Pending  Completed   Blocked  All Time 
Blocked
INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
MutationStage   256 81824 3360532756 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
ViewMutationStage 0 0  0 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
ReadStage 0 0   62862266 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
RequestResponseStage  0 0 2176659856 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
ReadRepairStage   0 0  0 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
CounterMutationStage  0 0  0 0  
   0
...
{noformat}
Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different 
node, high-level, it looks like the code path underneath 
{{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 3.0.18 
compared to 2.1.18.
 !jfr_allocations.png!

Left => 3.0.18
 Right => 2.1.18

JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I 
can upload them, if there is another destination available.

  was:
In a 6 node loadtest cluster, we have been running with 2.1.18 a certain 
production-like workload constantly and sufficiently. After upgrading one node 
to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of 
regression described below), 3.0.18 is showing increased CPU usage, increase 
GC, high mutation stage pending tasks, dropped mutation messages ...

Some spec. All 6 nodes equally sized:
* Bare metal, 32 physical cores, 512G RAM
* Xmx31G, G1, max pause millis = 2000ms
* cassandra.yaml basically unchanged, thus some settings in regard to number of 
threads, compaction throttling etc.

Following dashboard shows highlighted areas (CPU, suspension) with metrics for 
all 6 nodes and the outlier being the node upgraded to Cassandra 3.0.18.
!dashboard.png|width=1280!

Additionally we see a large increase on pending tasks in the mutation stage 
after the upgrade:
!mutation_stage.png!

And dropped mutation messages, also confirmed in the Cassandra log:
{noformat}
INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - 
MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 
0 for cross node timeout
INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool 
NameActive   Pending  Completed   Blocked  All Time 
Blocked
INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
MutationStage   256 81824 3360532756 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
ViewMutationStage 0 0  0 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - 
ReadStage 0 0   62862266 0  
   0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - 
RequestResponseStage  0 0 2176659856 0