[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091331#comment-17091331 ] Benjamin Lerer commented on CASSANDRA-14747: [~vinaykumarcse], [~jolynch] It might make sense for you to wait for CASSANDRA-15700 as it might have an impact on your test results. > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task > Components: Legacy/Testing >Reporter: Joey Lynch >Assignee: Joey Lynch >Priority: Normal > Fix For: 4.0-beta > > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, > 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > trunk_14503_v2_cpuflamegraph.svg, trunk_vs_3.0.17_latency_under_load.png, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090760#comment-17090760 ] Vinay Chella commented on CASSANDRA-14747: -- The majority of our tests in this ticket were on the CASSANDRA-14503 branch with the goal of evaluating 4.0 (better latency, more throughput, fewer threads, fewer context switches, less GC allocation, and faster recovery time) while all internode settings off (no compression, no encryption, no coalescing), similar runs and results were recorded in CASSANDRA-15175 while internode settings on (compression and encryption). However, with CASSANDRA-15066 being merged, we might have to reevaluate these runs on the latest trunk/alpha-4. > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task > Components: Legacy/Testing >Reporter: Joey Lynch >Assignee: Joey Lynch >Priority: Normal > Fix For: 4.0-beta > > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, > 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > trunk_14503_v2_cpuflamegraph.svg, trunk_vs_3.0.17_latency_under_load.png, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090689#comment-17090689 ] Benjamin Lerer commented on CASSANDRA-14747: This ticket has not been updated since almost 5 months and it is not clear to me what the expected output for it is. [~jolynch] [~vinaykumarcse] what is the status of that ticket? > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task > Components: Legacy/Testing >Reporter: Joey Lynch >Assignee: Joey Lynch >Priority: Normal > Fix For: 4.0-beta > > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, > 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > trunk_14503_v2_cpuflamegraph.svg, trunk_vs_3.0.17_latency_under_load.png, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677293#comment-16677293 ] Vinay Chella commented on CASSANDRA-14747: -- Thank you [~jasobrown] for the patch on CASSANDRA-14503 [~jolynch] and I benchmarked Jason's 14503-v2 branch, our benchmark results show [trunk-Jason's branch|https://github.com/jasobrown/cassandra/tree/14503-v2] is significantly out-performing 3.0.17 in terms of mean, 99th, and 95th percentile during a pure write benchmark. When systems are under heavy load, we have seen coordinator mean latencies are ~14x better, 99th latencies are ~4x better and 95th latencies are ~6x better on the trunk. When both trunk and 3.0.17 had 67k write QPS applied, throughput is steady on the trunk and 3.0.17 fell over. Note that we have only tested writes in this benchmark. However, the trunk is accumulating more hints than 3.0.17 and dropping messages compared to 3.0.17, these issues are yet to troubleshoot. For a detailed analysis of this benchmarking, find attached document [Cassandra 4.0 testing with CASSANDRA-14503 fixes] > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, > 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > trunk_14503_v2_cpuflamegraph.svg, trunk_vs_3.0.17_latency_under_load.png, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643429#comment-16643429 ] C. Scott Andreas commented on CASSANDRA-14747: -- Echoing that - great find and a real nice looking before-and-after; thanks Joey! > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, > 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > trunk_vs_3.0.17_latency_under_load.png, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638181#comment-16638181 ] Jason Brown commented on CASSANDRA-14747: - Excellent find, [~jolynch]. Looks like we added the ability to set the send/recv buffer size in CASSANDRA-3378 (which apparently I reviewed, 5.5 years ago). Looks like in 3.11 we [set the SO_SNDBUF|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L444] if the operator provided a value in the yaml, but we did not set a default value. However, it does appear I added a hard-coded default in 4.0 with CASSANDRA-8457. As it's been nearly two years since I wrote that part of the patch, I have no recollection of why I added a default. Removing it is trivial and has huge benefits, as has proven. I'm working on combining the findings [~jolynch] and I have discovered over the last weeks and should have a patch ready in a few days (which will probably be part CASSANDRA-14503, as most of this work was based on that work-in-progress). > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, > 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > trunk_vs_3.0.17_latency_under_load.png, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637835#comment-16637835 ] Dinesh Joshi commented on CASSANDRA-14747: -- [~jolynch] this is pretty cool! > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, > 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > trunk_vs_3.0.17_latency_under_load.png, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637652#comment-16637652 ] Joseph Lynch commented on CASSANDRA-14747: -- [~jasobrown] Ok, I think we found the problem! In the new Netty code we explicitly set the {{SO_SNDBUF}} [of the outbound socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332] to 64KB. This works great if you have a low latency connection, but for long fat networks this is a serious issue as you restrict your bandwidth significantly due to a high [bandwidth delay product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests we've been running where we are trying to push a semi reasonable amount of traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we just don't have enough bandwidth even though the actual link is very high bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples throughput: {noformat} # On the eu-west-1 node X $ iperf -s -p 8080 Server listening on TCP port 8080 TCP window size: 12.0 MByte (default) [ 4] local X port 8080 connected with Y port 26964 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.5 sec 506 MBytes 404 Mbits/sec [ 5] local X port 8080 connected with Y port 27050 [ 5] 0.0-10.5 sec 8.50 MBytes 6.81 Mbits/sec # On the us-east-1 node Y about 80ms away $ iperf -N -c X -p 8080 Client connecting to X, TCP port 8080 TCP window size: 12.0 MByte (default) [ 3] local Y port 26964 connected with X port 8080 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 506 MBytes 421 Mbits/sec $ iperf -N -w 64K -c X -p 8080 Client connecting to X, TCP port 8080 TCP window size: 128 KByte (WARNING: requested 64.0 KByte) [ 3] local Y port 27050 connected with X port 8080 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 8.50 MBytes 7.03 Mbits/sec {noformat} So instead of Cassandra getting the full link's bandwidth of 500mbps we're only able to get 7mbps. This is lower than the 8mbps we need to push so the us-east-1 -> eu-west-1 queues effectively grow without bound until we start dropping messages. I applied a [patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff] which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is completely wonderful* now. Some ways that things are wonderful: 1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent in compaction (both in garbage creation and actual cpu time): {noformat} 2018-10-03T23:56:40.889+ Process summary process cpu=321.33% application cpu=301.46% (user=185.93% sys=115.52%) other: cpu=19.88% thread count: 274 GC time=5.27% (young=5.27%, old=0.00%) heap allocation rate 478mb/s safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms safe point sync time: 0.08% processing time: 5.38% (wallclock time) [000135] user=49.03% sys=11.84% alloc= 142mb/s - CompactionExecutor:1 [000136] user=44.60% sys=13.81% alloc= 133mb/s - CompactionExecutor:2 [000198] user= 0.00% sys=41.46% alloc= 4833b/s - NonPeriodicTasks:1 [10] user= 9.56% sys= 0.67% alloc= 57mb/s - spectator-gauge-polling-0 [29] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1 [36] user= 0.00% sys= 8.98% alloc= 2598b/s - PERIODIC-COMMIT-LOG-SYNCER [000115] user= 5.74% sys= 2.22% alloc= 12mb/s - MessagingService-NettyInbound-Thread-3-1 [000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - MessagingService-NettyOutbound-Thread-4-3 [000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - MessagingService-NettyOutbound-Thread-4-2 [000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1 [000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2 [000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3 [000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4 [000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - MessagingService-NettyOutbound-Thread-4-1 [000132] user= 2.21% sys= 2.04% alloc= 11mb/s - MessagingService-NettyInbound-Thread-3-2 [000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5 [000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - MessagingService-NettyInbound-Thread-3-4 [000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - MutationStage-6 [000133] user= 1.81% sys= 1.83%
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634851#comment-16634851 ] Joseph Lynch commented on CASSANDRA-14747: -- Ah yea I see that's a problem. I worked around it by making a new callback just for that case. While I was testing it out I also tested flushing unconditionally [https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-some-more-tweaks-diff,] and CPU usage dropped by about half and the flamegraph looks _excellent_. I've attached the flamegraph as [^4.0.12-after-unconditional-flush.svg], where we can see that after the unconditional flush we are spending less than 7% CPU usage now! (compared to like 70%). I think that with 198 other nodes we were spending a lot of time waiting with data in the channel that's unflushed because well there are 195 other queues that get to be serviced before you get serviced again and fill up the channel. We're not done yet as we still have dropped messages (vs 3.0 which has very few if any dropped), but this is much better. > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, > 4.0.7-before-my-changes.svg, 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634301#comment-16634301 ] Jason Brown commented on CASSANDRA-14747: - [~jolynch] Nice work. I agree the time bounding of dequeueMessages is somewhat questionable - I added it in when we were making a bunch of other changes for dealing with CPU/task starvation. In your gist, I think we can run into some serious overscheduling (re-enqueueing of the consumer task) when the channel is unwritable. In that case, it will break out of dequeueMessages's while loop immediately, but then immediately reschedule (assuming backlog > 0). We'll keep doing this, very aggressively, until the channel becomes writable again - yet we cannot make any meaningful progress. To counteract this, that's why I had dequeueMessages not reschedule, but instead had handleMessageResult reschedule because at that point (remember, we only attach the listener to that last message of the bunch) we know the bytes have been written to the socket and that channel should be writable again. In this case we only schedule (or directly execute) dequeueMessages when we need to. (Note: this was probably not apparent from the current code's comments, so I should definitely improve that.) > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.7-before-my-changes.svg, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633527#comment-16633527 ] Joseph Lynch commented on CASSANDRA-14747: -- With the latest patches on 14503-collab we have made it to full client QPS of 49k qps! The latencies are also better or about the same to 3.0.17 for LOCAL_QUORUM although I think that is because the 4.x cluster is still turning a lot of data into hints and it's still using lots more CPU than 3.0.17. This is really good progress though. I've attached flamegraphs of 854789def57dd79399c2a5e45a2c43e3de272136 from 14503-collab, which is in [^4.0.7-before-my-changes.svg] With some minor changes [https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d] I was able to reduce CPU usage by about 20%, still have a ways to go though as the new flamegraph indicates there is still a lot of optimization available: [^4.0.11-after-jolynch-tweaks.svg] > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.7-before-my-changes.svg, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624446#comment-16624446 ] Vinay Chella commented on CASSANDRA-14747: -- CPU usage is much better after jctools' MpscLinkedQueue switch from ConcurrentLinkedQueue. Attached flame graphs ([^useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg] [^useast1e-i-08635fa1631601538_flamegraph_96node.svg] ), ttop([^useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes] ) results from sha(8749df78b29b05d0d3643f1b0c6c2112b79aaca8) of jasobrown/14503-collab branch. Interestingly though even with the higher CPU load, the 4.0 cluster has lower latencies for the LOCAL_ONE requests compared to 3.0.x cluster. Next focus area is on reducing CPU usage at CAS and state transitions. > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619974#comment-16619974 ] Joseph Lynch commented on CASSANDRA-14747: -- Things went much better today, after the queue fixes we no longer ran out of memory, but the {{MessagingService-NettyOutbound-Thread}}s would be pinned at 100% cpu. We (Jason, Jordan, myself, etc) tracked it down to various unfortunate looping behaviors in the {{OutboundMessagingConnection}} class. We're following up with various fixes to these queueing problems. I've attached flame graphs and ttop outputs showing what's going on on the latest version of {{jasobrown/14503-collab}} branch. > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619566#comment-16619566 ] Joseph Lynch commented on CASSANDRA-14747: -- We're re-running the test today with the 14503 branch pulled in, will record results here. > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16615321#comment-16615321 ] Jeff Jirsa commented on CASSANDRA-14747: Thanks so much for running this! > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612937#comment-16612937 ] Jason Brown commented on CASSANDRA-14747: - [~jolynch] When you have a chance, please take the branch linked on CASSANDRA-14503 and give it a spin. That has the fix for queue bounds. > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612856#comment-16612856 ] Joseph Lynch commented on CASSANDRA-14747: -- *Setup:* * Cassandra: 192 (2*96) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) running cassandra trunk f25a765b. * Two datacenters with 100ms latency between them * No compression, encryption, or coalescing turned on *Test:* ndbench sent 30k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 180k global replica QPS) of 40kb and then 4kb single partition BATCH mutations. This represents about 300 QPS per coordinator in the first datacenter or 75 per core. *Result:* We quickly overwhelmed the 4.0 cluster which started having high latencies and throwing errors while the 3.0 cluster remained healthy. 4.0 nodes were running out of heap within minutes during the 40kb test and a few minutes with the 4kb test. I've attached flamegraphs showing 4.0 spending half its time garbage collecting and logs indicating large on heap usage. On the bright side the thread count was _way down:_ the 3.0 cluster had 1.2k threads and the 4.0 cluster only had 220 threads (and almost all of that reduction was the messaging thread reduction). Also the startup time was super fast (as in less than one second to handshake the entire cluster, vs 3.0 which took minutes. We didn't feel that proceeding with the test made sense given the instability until follow ups could be committed. We used heap dumps and {{jmap}} to determine the issue was the outgoing message queue retaining large numbers of mutations on heap rather than dropping them. *Follow Ups:* The outgoing queue holding mutations on heap appears to be the problem. Specifically the 3.x code would police the internode queues and ensure they did not get too large at enqueue and dequeue time (expiring messages and turning them into hints as needed), the 4.0 code took out the enqueue policing complexity in the hope that we wouldn't need it. It appears it is necessary. [~jasobrown] is including fixes to the queue policing in CASSANDRA-14503 and CASSANDRA-13630 and we will re-execute this test once those are merged to ensure that they fix the issue with large volume mutations. > Evaluate 200 node, compression=none, encryption=none, coalescing=off > - > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task >Reporter: Joseph Lynch >Priority: Major > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org