[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266402#comment-14266402
 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 1/6/15 5:05 PM:
-------------------------------------------------------------------

I think I stumbled onto what is going on based on Benedict's suggestion to 
disable TCP no delay. It looks like there is a small message performance issue. 

This is something I have seen before in EC2 where you can only send a 
surprisingly small number of messages in/out of a VM. I don't have the numbers 
from when I micro benchmarked it, but it is something like 450k messages with 
TCP no delay and a million or so without. Adding more sockets helps but it 
doesn't  even double the number of messages in/out. Throwing more cores at the 
problem doesn't help you just end up with under utilized cores which matches 
the mysterious levels of starvation I was seeing in C* even though I was 
exposing sufficient concurrency.

14 servers nodes. 6 client nodes. 500 threads per client. Server started with
        "row_cache_size_in_mb" : "2000",
        "key_cache_size_in_mb":"500",
        "rpc_max_threads" : "1024",
        "rpc_min_threads" : "16",
        "native_transport_max_threads" : "1024"
8-gig old gen, 2 gig new gen.

Client running CL=ALL and the same schema I have been using throughout this 
ticket.

With no delay off
First set of runs
390264
387958
392322
After replacing 10 instances
366579
365818
378221

No delay on 
162987

Modified trunk to fix a bug batching messages and add a configurable window for 
coalescing multiple messages into a socket see 
https://github.com/aweisberg/cassandra/compare/f733996...49c6609
||Coalesce window microseconds|Throughput||
|250| 502614|
|200| 496206|
|150| 487195|
|100| 423415|
|50| 326648|
|25| 308175|
|12| 292894|
|6| 268456|
|0| 153688|

I did not expect to get mileage out of coalescing at the application level but 
it works extremely well. CPU utilization is still low at 1800%. There seems to 
be less correlation between CPU utilization and throughput as I vary the 
coalescing window and throughput changes dramatically. I do see that core 0 is 
looking pretty saturated and is only 10% idle. That might be the next or actual 
bottleneck.

What role this optimization plays at different cluster sizes is an important 
question. There has to be a tipping point where coalescing stops working 
because not enough packets go to each end point at the same time. With vnodes 
it wouldn't be unusual to be communicating with a large number of other hosts 
right?

It also takes a significant amount of additional latency to get the mileage at 
high levels of throughput, but at lower concurrency there is no benefit and it 
will probably show up as decreased throughput. It makes it tough to crank it up 
as a default. Either it is adaptive or most people don't get the benefit.

At high levels of throughput it is a clear latency win. Latency is much lower 
for individual requests on average. Making this a config option is viable as a 
starting point. Possibly a separate option for local/remote DC coalescing. 
Ideally we could make it adapt to the workload.

I am going to chase down what impact coalescing has at lower levels of 
concurrency so we can quantify the cost of turning it on. I'm also going to try 
and get to the bottom of all interrupts going to core 0. Maybe it is the real 
problem and coalescing is just a band aid to get more throughput.


was (Author: aweisberg):
I think I stumbled onto what is going on based on Benedict's suggestion to 
disable TCP no delay. It looks like there is a small message performance issue. 

This is something I have seen before in EC2 where you can only send a 
surprisingly small number of messages in/out of a VM. I don't have the numbers 
from when I micro benchmarked it, but it is something like 450k messages with 
TCP no delay and a million or so without. Adding more sockets helps but it 
doesn't  even double the number of messages in/out. Throwing more cores at the 
problem doesn't help you just end up with under utilized cores which matches 
the mysterious levels of starvation I was seeing in C* even though I was 
exposing sufficient concurrency.

14 servers nodes. 6 client nodes. 500 threads per client. Server started with
        "row_cache_size_in_mb" : "2000",
        "key_cache_size_in_mb":"500",
        "rpc_max_threads" : "1024",
        "rpc_min_threads" : "16",
        "native_transport_max_threads" : "1024"
8-gig old gen, 2 gig new gen.

Client running CL=ALL and the same schema I have been using throughout this 
ticket.

With no delay off
First set of runs
390264
387958
392322
After replacing 10 instances
366579
365818
378221

No delay on 
162987

Modified trunk to fix a bug batching messages and add a configurable window for 
coalescing multiple messages into a socket see 
https://github.com/aweisberg/cassandra/compare/f733996...49c6609
||Coalesce window microseconds|Throughput||
|250| 502614|
|200| 496206|
|150| 487195|
|100| 423415|
|50| 326648|
|25| 308175|
|12| 292894|
|6| 268456|
|0| 153688|

I did not expect get mileage out of coalescing at the application level but it 
works extremely well. CPU utilization is still low at 1800%. There seems to be 
less correlation between CPU utilization and throughput as I vary the 
coalescing window and throughput changes dramatically. I do see that core 0 is 
looking pretty saturated and is only 10% idle. That might be the next or actual 
bottleneck.

What role this optimization plays at different cluster sizes is an important 
question. There hast to be a tipping point where coalescing stops working  
because not enough packets go to each end point at the same time. With vnodes 
it wouldn't be unusual to be communicating with a large number of other hosts 
right?

It also takes a significant amount of additional latency to get the mileage at 
high levels of throughput, but at lower concurrency there is no benefit and it 
will probably show up as decreased throughput. It makes it tough to crank it up 
as a default. Either it is adaptive or most people don't get the benefit.

At high levels of throughput it is a clear latency win. Latency is much lower 
for individual requests on average. Making this a config option is viable as a 
starting point. Possibly a separate option for local/remote DC coalescing. 
Ideally we could make it adapt to the workload.

I am going to chase down what impact coalescing has at lower levels of 
concurrency so we can quantify the cost of turning it on. I'm also going to try 
and get to the bottom of all interrupts going to core 0. Maybe it is the real 
problem and coalescing is just a band aid to get more throughput.

> nio MessagingService
> --------------------
>
>                 Key: CASSANDRA-8457
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Ariel Weisberg
>              Labels: performance
>             Fix For: 3.0
>
>
> Thread-per-peer (actually two each incoming and outbound) is a big 
> contributor to context switching, especially for larger clusters.  Let's look 
> at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to