[jira] [Commented] (CASSANDRA-7217) Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads
[ https://issues.apache.org/jira/browse/CASSANDRA-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025012#comment-15025012 ] Joshua McKenzie commented on CASSANDRA-7217: [~tjake] to review. > Native transport performance (with cassandra-stress) drops precipitously past > around 1000 threads > - > > Key: CASSANDRA-7217 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7217 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Benedict >Assignee: Ariel Weisberg > Labels: performance, stress, triaged > Fix For: 3.0.1, 3.1 > > Attachments: 2000-threads.svg, 500-threads.svg, FakeQuerySystem.java, > stub_server.diff > > > This is obviously bad. Let's figure out why it's happening and put a stop to > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7217) Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads
[ https://issues.apache.org/jira/browse/CASSANDRA-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025016#comment-15025016 ] T Jake Luciani commented on CASSANDRA-7217: --- LGTM +1 > Native transport performance (with cassandra-stress) drops precipitously past > around 1000 threads > - > > Key: CASSANDRA-7217 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7217 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Benedict >Assignee: Ariel Weisberg > Labels: performance, stress, triaged > Fix For: 3.0.1, 3.1 > > Attachments: 2000-threads.svg, 500-threads.svg, FakeQuerySystem.java, > stub_server.diff > > > This is obviously bad. Let's figure out why it's happening and put a stop to > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7217) Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads
[ https://issues.apache.org/jira/browse/CASSANDRA-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007258#comment-15007258 ] Ariel Weisberg commented on CASSANDRA-7217: --- I was able to narrow this down to a configuration issue with the driver combined with less than perfect behavior if you don't run with this configuration. If I increase the maximum number of pending requests per connection from 128 to 256 then the performance at 1250 threads goes back to normal. For stress we can do something smarter when setting this tunable to reflect the number of available threads. Generally if we have a thread submitting requests we would want it to default to having a pending request against the server otherwise all you are really benchmarking is the driver's ability to deal with pending requests. Then there is separate driver issue of the degradation in performance when the number of pending requests is not high enough. I wouldn't expect that kind of drop off. Whether the request is pending at the client or languishing in a TCP buffer in the server shouldn't really matter. I haven't looked, but my guess is that when the driver reaches the limit the thread submitting a requests goes to sleep, and then it is woken up again. This means that every request has to flow through some extra scheduling points per request to account for this. A better way is to always flatten the serialized request to a shared buffer and when the connection is ready to accept more work the network thread can wake up and write multiple requests to the server at once. > Native transport performance (with cassandra-stress) drops precipitously past > around 1000 threads > - > > Key: CASSANDRA-7217 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7217 > Project: Cassandra > Issue Type: Bug >Reporter: Benedict >Assignee: Ariel Weisberg > Labels: performance, stress, triaged > Fix For: 3.0.1, 3.1 > > Attachments: 2000-threads.svg, 500-threads.svg, FakeQuerySystem.java, > stub_server.diff > > > This is obviously bad. Let's figure out why it's happening and put a stop to > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7217) Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads
[ https://issues.apache.org/jira/browse/CASSANDRA-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007275#comment-15007275 ] Ariel Weisberg commented on CASSANDRA-7217: --- Created https://datastax-oss.atlassian.net/browse/JAVA-992 for the Java suspected client driver issue. > Native transport performance (with cassandra-stress) drops precipitously past > around 1000 threads > - > > Key: CASSANDRA-7217 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7217 > Project: Cassandra > Issue Type: Bug >Reporter: Benedict >Assignee: Ariel Weisberg > Labels: performance, stress, triaged > Fix For: 3.0.1, 3.1 > > Attachments: 2000-threads.svg, 500-threads.svg, FakeQuerySystem.java, > stub_server.diff > > > This is obviously bad. Let's figure out why it's happening and put a stop to > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7217) Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads
[ https://issues.apache.org/jira/browse/CASSANDRA-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003017#comment-15003017 ] Ariel Weisberg commented on CASSANDRA-7217: --- Performance counters 2000 threads {code} Results: op rate : 19419 [WRITE:19419] partition rate: 19419 [WRITE:19419] row rate : 19419 [WRITE:19419] latency mean : 103.0 [WRITE:103.0] latency median: 91.3 [WRITE:91.3] latency 95th percentile : 179.4 [WRITE:179.4] latency 99th percentile : 252.3 [WRITE:252.3] latency 99.9th percentile : 428.5 [WRITE:428.5] latency max : 57651.8 [WRITE:57651.8] Total partitions : 1900 [WRITE:1900] Total errors : 0 [WRITE:0] total gc count: 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:16:18 END Performance counter stats for './cassandra-stress write n=1900 -rate threads=2000 -mode native cql3 -node 192.168.1.9': 3,320,451,421,007 cycles#2.192 GHz [15.41%] 2,563,758,232,484 instructions #0.77 insns per cycle #0.94 stalled cycles per insn [20.47%] 69,188,067,241 cache-references # 45.664 M/sec [25.56%] 13,456,198,724 cache-misses # 19.449 % of all cache refs [30.60%] 131,776,347,830 bus-cycles# 86.973 M/sec [35.65%] 2,415,412,133,089 idle-cycles-frontend # 72.74% frontend cycles idle[40.69%] 1,750,197,198,741 idle-cycles-backend # 52.71% backend cycles idle[45.75%] 1514363.238593 cpu-clock (msec) 1515146.390785 task-clock (msec) #1.530 CPUs utilized 154,815 page-faults #0.102 K/sec 87,357,050 cs#0.058 M/sec 37,030,093 migrations#0.024 M/sec 154,691 minor-faults #0.102 K/sec 0 major-faults #0.000 K/sec 0 alignment-faults #0.000 K/sec 0 emulation-faults #0.000 K/sec 358,579,878,595 branch-instructions # 236.664 M/sec [45.74%] 5,088,330,722 branch-misses #1.42% of all branches [45.80%] 70,350,080,393 L1-dcache-load-misses # 46.431 M/sec [45.92%] 24,626,765,787 L1-dcache-store-misses# 16.254 M/sec [40.88%] 19,812,757,638 L1-dcache-prefetch-misses # 13.076 M/sec [40.97%] 59,285,911,291 L1-icache-load-misses # 39.129 M/sec [40.92%] 4,437,071,985 dTLB-load-misses #2.928 M/sec [40.90%] 821,151,709 dTLB-store-misses #0.542 M/sec [40.80%] 1,188,402,914 iTLB-load-misses #0.784 M/sec [40.66%] 5,274,857,779 branch-load-misses#3.481 M/sec [40.58%] 39,293,189,238 LLC-loads # 25.934 M/sec [40.47%] 10,625,403,856 LLC-stores#7.013 M/sec [40.45%] 16,978,686,645 LLC-prefetches# 11.206 M/sec [10.08%] 990.019887601 seconds time elapsed {code} 500 threads {code} Results: op rate : 63678 [WRITE:63678] partition rate: 63678 [WRITE:63678] row rate : 63678 [WRITE:63678] latency mean : 7.8 [WRITE:7.8] latency median: 5.6 [WRITE:5.6] latency 95th percentile : 16.8 [WRITE:16.8] latency 99th percentile : 36.5 [WRITE:36.5] latency 99.9th percentile : 77.5 [WRITE:77.5] latency max : 358.8 [WRITE:358.8] Total partitions : 1900 [WRITE:1900] Total errors : 0 [WRITE:0] total gc count: 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:04:58 END Performance counter stats for './cassandra-stress write n=1900 -rate threads=500 -mode native cql3 -node 192.168.1.9': 2,055,138,822,781 cycles#2.519 GHz [15.25%] 1,923,953,212,761 instructions #0.94 insns per cycle
[jira] [Commented] (CASSANDRA-7217) Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads
[ https://issues.apache.org/jira/browse/CASSANDRA-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003199#comment-15003199 ] Ariel Weisberg commented on CASSANDRA-7217: --- My takeaway from the counters is that with 2000 threads working through 19 million writes took more instructions, almost double the number of cache references, more than double the number of context switches, and double the number of dcache misses. So there was a big drop in efficiency that could explain how this occurs even without contention or starvation. Now if there is a way to have 2000 threads do this work more efficiently is a good question. There are a lot more performance counters that might give insight into what having more threads changed as well as profiling. I'll look into it tomorrow. > Native transport performance (with cassandra-stress) drops precipitously past > around 1000 threads > - > > Key: CASSANDRA-7217 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7217 > Project: Cassandra > Issue Type: Bug >Reporter: Benedict >Assignee: Ariel Weisberg > Labels: performance, stress, triaged > Fix For: 3.1 > > > This is obviously bad. Let's figure out why it's happening and put a stop to > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7217) Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads
[ https://issues.apache.org/jira/browse/CASSANDRA-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002303#comment-15002303 ] Ariel Weisberg commented on CASSANDRA-7217: --- I was able to reproduce this running the server on my OS X laptop and the client on my quad-core i5 Sandy Bridge Linux desktop. With 500 threads I was getting 80k op/sec and with 2000 I was getting 30k op/sec. I took flight recordings, but they are too big to look at and not that interesting. There is more contention detected with a 1 millisecond threshold at 500 threads then at 2000 threads presumably because with 500 threads so much more work is getting done. CPU utilization at the client is pretty high at 500 threads, above 300%. 18k interrupts/second and 140k context switches/second. With 2000 threads utilization is lower more towards 250% with closer to 10k interrupts/second, but 250-300k context switches/second. My hypothesis is that having so many client threads is a problem for the Netty threads because there are more client threads then event threads by a large margin. With only one server there would really only be one since there is a single connection. In cstar on bdplab I see a sharp drop between 1000 and 1250 threads. I would have expected a graceful slope and the overhead of context switching threads increases so there is still more to be explained. > Native transport performance (with cassandra-stress) drops precipitously past > around 1000 threads > - > > Key: CASSANDRA-7217 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7217 > Project: Cassandra > Issue Type: Bug >Reporter: Benedict >Assignee: Ariel Weisberg > Labels: performance, stress, triaged > Fix For: 3.1 > > > This is obviously bad. Let's figure out why it's happening and put a stop to > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7217) Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads
[ https://issues.apache.org/jira/browse/CASSANDRA-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200604#comment-14200604 ] Shawn Kumar commented on CASSANDRA-7217: I'll be continuing testing on a more cpu-perfomant instance but thought I would briefly try the cstar_perf on bdplab. [Here|http://cstar.datastax.com/graph?stats=dd73c4a6-65d9-11e4-9413-bc764e04482cmetric=op_rateoperation=1_writesmoothing=1show_aggregates=truexmin=0xmax=279.07ymin=0ymax=120665.6] are the results - I increase the threads from 500 - 1500 in 250 thread increments from the first operation to the last and it seems like there is a noticeable drop. Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads - Key: CASSANDRA-7217 URL: https://issues.apache.org/jira/browse/CASSANDRA-7217 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Shawn Kumar Labels: performance, triaged Fix For: 2.1.2 This is obviously bad. Let's figure out why it's happening and put a stop to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7217) Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads
[ https://issues.apache.org/jira/browse/CASSANDRA-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996595#comment-13996595 ] Jason Brown commented on CASSANDRA-7217: Do you think this is a problem on the stress side, or on the server side? Do you see a problem with thrift? Lastly, should I assume this arose due to testing your changes on #4718? Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads - Key: CASSANDRA-7217 URL: https://issues.apache.org/jira/browse/CASSANDRA-7217 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Labels: performance Fix For: 2.1.0 This is obviously bad. Let's figure out why it's happening and put a stop to it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7217) Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads
[ https://issues.apache.org/jira/browse/CASSANDRA-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997409#comment-13997409 ] Benedict commented on CASSANDRA-7217: - I haven't investigated much at all, so I don't know the answer to any of these questions yet (except that it did indeed come about off the back of CASSANDRA-4718). The only thing I can say for sure is that it is unrelated to MaxRPC (i.e. nothing to do with native transport threads blocking on adding to the work queue). Native transport performance (with cassandra-stress) drops precipitously past around 1000 threads - Key: CASSANDRA-7217 URL: https://issues.apache.org/jira/browse/CASSANDRA-7217 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Labels: performance Fix For: 2.1.0 This is obviously bad. Let's figure out why it's happening and put a stop to it. -- This message was sent by Atlassian JIRA (v6.2#6252)