[
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266402#comment-14266402
]
Ariel Weisberg edited comment on CASSANDRA-8457 at 1/6/15 5:05 PM:
-------------------------------------------------------------------
I think I stumbled onto what is going on based on Benedict's suggestion to
disable TCP no delay. It looks like there is a small message performance issue.
This is something I have seen before in EC2 where you can only send a
surprisingly small number of messages in/out of a VM. I don't have the numbers
from when I micro benchmarked it, but it is something like 450k messages with
TCP no delay and a million or so without. Adding more sockets helps but it
doesn't even double the number of messages in/out. Throwing more cores at the
problem doesn't help you just end up with under utilized cores which matches
the mysterious levels of starvation I was seeing in C* even though I was
exposing sufficient concurrency.
14 servers nodes. 6 client nodes. 500 threads per client. Server started with
"row_cache_size_in_mb" : "2000",
"key_cache_size_in_mb":"500",
"rpc_max_threads" : "1024",
"rpc_min_threads" : "16",
"native_transport_max_threads" : "1024"
8-gig old gen, 2 gig new gen.
Client running CL=ALL and the same schema I have been using throughout this
ticket.
With no delay off
First set of runs
390264
387958
392322
After replacing 10 instances
366579
365818
378221
No delay on
162987
Modified trunk to fix a bug batching messages and add a configurable window for
coalescing multiple messages into a socket see
https://github.com/aweisberg/cassandra/compare/f733996...49c6609
||Coalesce window microseconds|Throughput||
|250| 502614|
|200| 496206|
|150| 487195|
|100| 423415|
|50| 326648|
|25| 308175|
|12| 292894|
|6| 268456|
|0| 153688|
I did not expect to get mileage out of coalescing at the application level but
it works extremely well. CPU utilization is still low at 1800%. There seems to
be less correlation between CPU utilization and throughput as I vary the
coalescing window and throughput changes dramatically. I do see that core 0 is
looking pretty saturated and is only 10% idle. That might be the next or actual
bottleneck.
What role this optimization plays at different cluster sizes is an important
question. There has to be a tipping point where coalescing stops working
because not enough packets go to each end point at the same time. With vnodes
it wouldn't be unusual to be communicating with a large number of other hosts
right?
It also takes a significant amount of additional latency to get the mileage at
high levels of throughput, but at lower concurrency there is no benefit and it
will probably show up as decreased throughput. It makes it tough to crank it up
as a default. Either it is adaptive or most people don't get the benefit.
At high levels of throughput it is a clear latency win. Latency is much lower
for individual requests on average. Making this a config option is viable as a
starting point. Possibly a separate option for local/remote DC coalescing.
Ideally we could make it adapt to the workload.
I am going to chase down what impact coalescing has at lower levels of
concurrency so we can quantify the cost of turning it on. I'm also going to try
and get to the bottom of all interrupts going to core 0. Maybe it is the real
problem and coalescing is just a band aid to get more throughput.
was (Author: aweisberg):
I think I stumbled onto what is going on based on Benedict's suggestion to
disable TCP no delay. It looks like there is a small message performance issue.
This is something I have seen before in EC2 where you can only send a
surprisingly small number of messages in/out of a VM. I don't have the numbers
from when I micro benchmarked it, but it is something like 450k messages with
TCP no delay and a million or so without. Adding more sockets helps but it
doesn't even double the number of messages in/out. Throwing more cores at the
problem doesn't help you just end up with under utilized cores which matches
the mysterious levels of starvation I was seeing in C* even though I was
exposing sufficient concurrency.
14 servers nodes. 6 client nodes. 500 threads per client. Server started with
"row_cache_size_in_mb" : "2000",
"key_cache_size_in_mb":"500",
"rpc_max_threads" : "1024",
"rpc_min_threads" : "16",
"native_transport_max_threads" : "1024"
8-gig old gen, 2 gig new gen.
Client running CL=ALL and the same schema I have been using throughout this
ticket.
With no delay off
First set of runs
390264
387958
392322
After replacing 10 instances
366579
365818
378221
No delay on
162987
Modified trunk to fix a bug batching messages and add a configurable window for
coalescing multiple messages into a socket see
https://github.com/aweisberg/cassandra/compare/f733996...49c6609
||Coalesce window microseconds|Throughput||
|250| 502614|
|200| 496206|
|150| 487195|
|100| 423415|
|50| 326648|
|25| 308175|
|12| 292894|
|6| 268456|
|0| 153688|
I did not expect get mileage out of coalescing at the application level but it
works extremely well. CPU utilization is still low at 1800%. There seems to be
less correlation between CPU utilization and throughput as I vary the
coalescing window and throughput changes dramatically. I do see that core 0 is
looking pretty saturated and is only 10% idle. That might be the next or actual
bottleneck.
What role this optimization plays at different cluster sizes is an important
question. There hast to be a tipping point where coalescing stops working
because not enough packets go to each end point at the same time. With vnodes
it wouldn't be unusual to be communicating with a large number of other hosts
right?
It also takes a significant amount of additional latency to get the mileage at
high levels of throughput, but at lower concurrency there is no benefit and it
will probably show up as decreased throughput. It makes it tough to crank it up
as a default. Either it is adaptive or most people don't get the benefit.
At high levels of throughput it is a clear latency win. Latency is much lower
for individual requests on average. Making this a config option is viable as a
starting point. Possibly a separate option for local/remote DC coalescing.
Ideally we could make it adapt to the workload.
I am going to chase down what impact coalescing has at lower levels of
concurrency so we can quantify the cost of turning it on. I'm also going to try
and get to the bottom of all interrupts going to core 0. Maybe it is the real
problem and coalescing is just a band aid to get more throughput.
> nio MessagingService
> --------------------
>
> Key: CASSANDRA-8457
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Ariel Weisberg
> Labels: performance
> Fix For: 3.0
>
>
> Thread-per-peer (actually two each incoming and outbound) is a big
> contributor to context switching, especially for larger clusters. Let's look
> at switching to nio, possibly via Netty.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)