[ 
https://issues.apache.org/jira/browse/CASSANDRA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791913#comment-13791913
 ] 

Benedict commented on CASSANDRA-1632:
-------------------------------------

Regrettably, after playing around extensively with Jason's patches (and a bit 
of my own tweaks), I simply don't see any performance improvement using FJP or 
FJP with ThreadAffinity. In fact, I see a consistent decline in most of my 
performance tests when using ThreadAffinity or ~5%, and a pretty much neutral 
response to using FJP. For those I told I saw improvement, this was a mistake; 
I was comparing apples/oranges.

I should note I'm focusing exclusively on in-memory reads at the moment, but 
these should be the most readily helped by any threading improvements, as they 
should be the lowest cost.

The one thing that has consistently improved performance is to perform local 
reads synchronously (i.e. not despatch them to the READ stage, but perform them 
directly in the requesting thread). This has shown a consistent 15%+ speed 
bump. The downside is that it's only going to be 15% * 1/N across the cluster, 
and without some major precautions could have negative consequences for 
out-of-memory reads (by by-passing the concurrency level for reads), so it 
doesn't seem like a promising avenue right now. The same trick can't be applied 
to remote reads, as they simply go IN(Single Thread)->READ(MT)->OUT(ST).

One thing I am considering looking into is separating the 
compression/decompression of the connections from the I/O; a reasonable 
percentage of CPU time is spent here, and for workloads involving only a small 
number of nodes this could be a bottleneck as the connection is starved of data 
it could be writing. By disabling compression I see a 20-25% speed bump for 
"remote" reads (local, forced over the network), which is no doubt an unfair 
test over loopback, but worth exploring as each connection is currently limited 
to what one CPU can compress in real-time. One obvious downside of this is that 
sensibly batching messages together for compression is difficult.


> Thread workflow and cpu affinity
> --------------------------------
>
>                 Key: CASSANDRA-1632
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1632
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chris Goffinet
>            Assignee: Jason Brown
>              Labels: performance
>
> Here are some thoughts I wanted to write down, we need to run some serious 
> benchmarks to see the benefits:
> 1) All thread pools for our stages use a shared queue per stage. For some 
> stages we could move to a model where each thread has its own queue. This 
> would reduce lock contention on the shared queue. This workload only suits 
> the stages that have no variance, else you run into thread starvation. Some 
> stages that this might work: ROW-MUTATION.
> 2) Set cpu affinity for each thread in each stage. If we can pin threads to 
> specific cores, and control the workflow of a message from Thrift down to 
> each stage, we should see improvements on reducing L1 cache misses. We would 
> need to build a JNI extension (to set cpu affinity), as I could not find 
> anywhere in JDK where it was exposed. 
> 3) Batching the delivery of requests across stage boundaries. Peter Schuller 
> hasn't looked deep enough yet into the JDK, but he thinks there may be 
> significant improvements to be had there. Especially in high-throughput 
> situations. If on each consumption you were to consume everything in the 
> queue, rather than implying a synchronization point in between each request.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to