[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637652#comment-16637652
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 10/4/18 12:29 AM:
--------------------------------------------------------------------

[~jasobrown] Ok, I think we found the problem! In the new Netty code we 
explicitly set the {{SO_SNDBUF}} [of the outbound 
socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332]
 to 64KB. This works great if you have a low latency connection, but for long 
fat networks this is a serious issue as you restrict your bandwidth 
significantly due to a high [bandwidth delay 
product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests 
we've been running where we are trying to push a semi reasonable amount of 
traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 
is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we 
just don't have enough bandwidth even though the actual link is very high 
bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples 
throughput:
{noformat}
# On the eu-west-1 node X
$ iperf -s -p 8080
------------------------------------------------------------
Server listening on TCP port 8080
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
[  4] local X port 8080 connected with Y port 26964
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.5 sec   506 MBytes   404 Mbits/sec
[  5] local X port 8080 connected with Y port 27050
[  5]  0.0-10.5 sec  8.50 MBytes  6.81 Mbits/sec

# On the us-east-1 node Y about 80ms away

$ iperf -N -c X -p 8080
------------------------------------------------------------
Client connecting to X, TCP port 8080
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
[  3] local Y port 26964 connected with X port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec   506 MBytes   421 Mbits/sec

$ iperf -N -w 64K -c X -p 8080
------------------------------------------------------------
Client connecting to X, TCP port 8080
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local Y port 27050 connected with X port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec  8.50 MBytes  7.03 Mbits/sec
{noformat}

So instead of Cassandra getting the full link's bandwidth of 500mbps we're only 
able to get 7mbps. This is lower than the 8mbps we need to push so the 
us-east-1 -> eu-west-1 queues effectively grow without bound until we start 
dropping messages.

I applied a 
[patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff]
 which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is 
completely wonderful* now. Some ways that things are wonderful:

1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent 
in compaction (both in garbage creation and actual cpu time):

{noformat}
2018-10-03T23:56:40.889+0000 Process summary 
  process cpu=321.33%
  application cpu=301.46% (user=185.93% sys=115.52%)
  other: cpu=19.88% 
  thread count: 274
  GC time=5.27% (young=5.27%, old=0.00%)
  heap allocation rate 478mb/s
  safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms
  safe point sync time: 0.08% processing time: 5.38% (wallclock time)
[000135] user=49.03% sys=11.84% alloc=  142mb/s - CompactionExecutor:1
[000136] user=44.60% sys=13.81% alloc=  133mb/s - CompactionExecutor:2
[000198] user= 0.00% sys=41.46% alloc=  4833b/s - NonPeriodicTasks:1
[000010] user= 9.56% sys= 0.67% alloc=   57mb/s - spectator-gauge-polling-0
[000029] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1
[000036] user= 0.00% sys= 8.98% alloc=  2598b/s - PERIODIC-COMMIT-LOG-SYNCER
[000115] user= 5.74% sys= 2.22% alloc=   12mb/s - 
MessagingService-NettyInbound-Thread-3-1
[000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - 
MessagingService-NettyOutbound-Thread-4-3
[000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - 
MessagingService-NettyOutbound-Thread-4-2
[000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1
[000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2
[000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3
[000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4
[000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - 
MessagingService-NettyOutbound-Thread-4-1
[000132] user= 2.21% sys= 2.04% alloc=   11mb/s - 
MessagingService-NettyInbound-Thread-3-2
[000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5
[000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - 
MessagingService-NettyInbound-Thread-3-4
[000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - MutationStage-6
[000133] user= 1.81% sys= 1.83% alloc= 6236kb/s - 
MessagingService-NettyInbound-Thread-3-3
[000153] user= 2.32% sys= 1.31% alloc= 4538kb/s - MutationStage-7
{noformat}

The flamegraph is also nice and clean!

2. The latencies for LOCAL_QUORUM are significantly better than 3.0.17. The 
average latency has improved from 2ms to 1.2ms (40% better!!) and the 99th 
percentile has reduced from close to 70ms to 25ms (60% better!!). I've attached 
a graph showing the significant latency improvements.

 !trunk_vs_3.0.17_latency_under_load.png! 

3. Cassandra trunk is no longer dropping any messages!


was (Author: jolynch):
[~jasobrown] Ok, I think we found the problem! In the new Netty code we 
explicitly set the {{SO_SNDBUF}} [of the outbound 
socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332]
 to 64KB. This works great if you have a low latency connection, but for long 
fat networks this is a serious issue as you restrict your bandwidth 
significantly due to a high [bandwidth delay 
product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests 
we've been running where we are trying to push a semi reasonable amount of 
traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 
is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we 
just don't have enough bandwidth even though the actual link is very high 
bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples 
throughput:
{noformat}
# On the eu-west-1 node X
$ iperf -s -p 8080
------------------------------------------------------------
Server listening on TCP port 8080
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
[  4] local X port 8080 connected with Y port 26964
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.5 sec   506 MBytes   404 Mbits/sec
[  5] local X port 8080 connected with Y port 27050
[  5]  0.0-10.5 sec  8.50 MBytes  6.81 Mbits/sec

# On the us-east-1 node Y about 80ms away

$ iperf -N -c X -p 8080
------------------------------------------------------------
Client connecting to X, TCP port 8080
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
[  3] local Y port 26964 connected with X port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec   506 MBytes   421 Mbits/sec

$ iperf -N -w 64K -c X -p 8080
------------------------------------------------------------
Client connecting to X, TCP port 8080
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local Y port 27050 connected with X port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec  8.50 MBytes  7.03 Mbits/sec
{noformat}

So instead of Cassandra getting the full link's bandwidth of 500mbps we're only 
able to get 7mbps. This is lower than the 8mbps we need to push so the 
us-east-1 -> eu-west-1 queues effectively grow without bound until we start 
dropping messages.

I applied a 
[patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff]
 which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is 
completely wonderful* now. Some ways that things are wonderful:

1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent 
in compaction (both in garbage creation and actual cpu time):

{noformat}
2018-10-03T23:56:40.889+0000 Process summary 
  process cpu=321.33%
  application cpu=301.46% (user=185.93% sys=115.52%)
  other: cpu=19.88% 
  thread count: 274
  GC time=5.27% (young=5.27%, old=0.00%)
  heap allocation rate 478mb/s
  safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms
  safe point sync time: 0.08% processing time: 5.38% (wallclock time)
[000135] user=49.03% sys=11.84% alloc=  142mb/s - CompactionExecutor:1
[000136] user=44.60% sys=13.81% alloc=  133mb/s - CompactionExecutor:2
[000198] user= 0.00% sys=41.46% alloc=  4833b/s - NonPeriodicTasks:1
[000010] user= 9.56% sys= 0.67% alloc=   57mb/s - spectator-gauge-polling-0
[000029] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1
[000036] user= 0.00% sys= 8.98% alloc=  2598b/s - PERIODIC-COMMIT-LOG-SYNCER
[000115] user= 5.74% sys= 2.22% alloc=   12mb/s - 
MessagingService-NettyInbound-Thread-3-1
[000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - 
MessagingService-NettyOutbound-Thread-4-3
[000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - 
MessagingService-NettyOutbound-Thread-4-2
[000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1
[000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2
[000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3
[000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4
[000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - 
MessagingService-NettyOutbound-Thread-4-1
[000132] user= 2.21% sys= 2.04% alloc=   11mb/s - 
MessagingService-NettyInbound-Thread-3-2
[000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5
[000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - 
MessagingService-NettyInbound-Thread-3-4
[000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - MutationStage-6
[000133] user= 1.81% sys= 1.83% alloc= 6236kb/s - 
MessagingService-NettyInbound-Thread-3-3
[000153] user= 2.32% sys= 1.31% alloc= 4538kb/s - MutationStage-7
{noformat}

2. The latencies for LOCAL_QUORUM are significantly better than 3.0.17. The 
average latency has improved from 2ms to 1.2ms (40% better!!) and the 99th 
percentile has reduced from close to 70ms to 25ms (60% better!!). I've attached 
a graph showing the significant latency improvements.

 !trunk_vs_3.0.17_latency_under_load.png! 

3. Cassandra trunk is no longer dropping any messages!

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-14747
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>            Priority: Major
>         Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, 
> 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, 
> 4.0_errors_showing_heap_pressure.txt, 
> 4.0_heap_histogram_showing_many_MessageOuts.txt, 
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg, 
> trunk_vs_3.0.17_latency_under_load.png, 
> ttop_NettyOutbound-Thread_spinning.txt, 
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, 
> useast1e-i-08635fa1631601538_flamegraph_96node.svg, 
> useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, 
> useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no coalescing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to