[
https://issues.apache.org/jira/browse/CASSANDRA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828301#comment-13828301
]
Jason Brown commented on CASSANDRA-1632:
----------------------------------------
OK, so after two months of trying to get thread affinity to rock the world, I
have to admit that I can't get better performance than what we currently get
with kernel scheduler. Details of code/testing follow, but my hunch as to why I
didn't see a boost, and in most cases, saw degradation, lie in with:
- we use many, many threads (over 1000 on some of Netflix's production servers,
currently running c* 1.1), more than than the count of processors. Most
studies/literature I found about thread affinity used a thread count less than
or equal to the cpu count, so not much help there.
- a given request with be passed across several threads/pools, and in order to
get the best cpu cache coherency for that data (which, I believe, would be the
best goal for this work), it's path through the code base should be on the same
processor (for L1/L2 cache hits) or on the same core (hopefully hit L3 cache).
I briefly thought about working this in, but the heavy use of statics and
singletons make this a tricky and daunting task as I would need to either 'kill
the damn singletons' (thx, @driftx :)) or carry around some complicated thread
state in thread local-like structures.
First off, using Peter Lawrey's thread affinity
(https://github.com/peter-lawrey/Java-Thread-Affinity) was reasonably easy to
use, and was able to easily pin threads to processors. I then created several
cpu affinity strategies:
- EqualSpreadCpuAffinityStrategy - Intended to be used with an
AbstractExecutorService (ThreadPoolExecutor or ForkJoinPool), this
implementation assigns each successively created thread within a pool to the
next sequential CPU, thus attempting to spread the threads around equally
amongst the CPUs. As we have several AbstractExecutorService instances inside
of cassandra, this implementation picks a random CPU to start with as this will
avoid overloading the lower numbered CPUs.
- FirstAssignmentCpuAffinityStrategy - pins a thread to the first cpu the
kernel assinged it to.
- NoCpuAffinityStrategy - NOP, used for comparison vs. others, but mainly used
for comparing vs. trunk.
I applied these strategies (one at a time) to the various ThreadPoolExecutors
we have (via overloading the NamedThreadFactory). After many variations, I
ended up just applying the thread affinity to several key places, including
OTC, ITC, READ & MUTATE stages, as well as the native protocol's
RequestThreadPoolExecutor and TDisruptorServer (you can see my hacked up
version of the disruptor-thrift lib at
https://github.com/jasobrown/disruptor_thrift_server/tree/thread_affinity).
One thing I discovered was that by isolating the CPUs that handle IRQ (exp.
disk and network IO) from cassandra, I did get a modest bump in throughput
(~5%). As it depends on the kernel and the OS's configuration as to whether the
disk/network IRQ is pinned to one (or more) specific CPU, this is a little
difficult to abstract out. Anecdotally, the ec2 instaces that i used for
testing always assigned cpu0 for blkio (disk) and cpu1 for network (eth0). Spot
checking other Netflix instances of different instance type and even different
kernel version, showed that the IRQ distribution was not consistent across our
nodes (very similar, but not the same). Thus, we could create a spin-off ticket
to have cassandra isolate itself from those cpus, I think that work should be
explored outside this ticket.
The remants of all my coding on thread affinity can be found here
https://github.com/jasobrown/cassandra/tree/1632_threadAffinity
Testing:
env: ec2, three nodes in us-west-1
m2.4xlarge, 8 processors, 68G RAM
Linux 3.2.0.52 (Ubunut 12.10 LTS)
cassandra - 8Gb head, 800 new gen
traffic-generating application: @belliottsmith's improved cassandra-stress
(https://github.com/belliottsmith/cassandra/tree/iss-6199-stress, 47a96c1f5557f)
Compared with trunk (during early-mid Nov 2013), I found the performance of
EqualSpreadCpuAffinityStrategy to be about 10-20% worse (throughtput and
latency). The FirstAssignmentCpuAffinityStrategy was more or less on par with
trunk.
I found that while thread affinity reduced the CPU migrations of threads (as
measured by 'perf stat -p $pid') by an order of magnitude, there was no
appreciable effeciency gain to cassandra as a whole. As I don't have access to
the PMU on ec2 instances (for obtaining metrics like L1/2/3 cache hit ratio
from perf), I could not measure if the thread affinity code actually made the
entire process more efficient. Either way, the latency and throughput
performance metrics obtained at the client (cassandra stress) did not bear out
an overall improvement in cassandra.
If we all we can do at best is match the kernel's scheduler, I feel confortable
with putting the thread affinity for now, unless anybody finds something I've
missed or misunderstood.
> Thread workflow and cpu affinity
> --------------------------------
>
> Key: CASSANDRA-1632
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1632
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Chris Goffinet
> Assignee: Jason Brown
> Labels: performance
> Attachments: threadAff_reads.txt, threadAff_writes.txt
>
>
> Here are some thoughts I wanted to write down, we need to run some serious
> benchmarks to see the benefits:
> 1) All thread pools for our stages use a shared queue per stage. For some
> stages we could move to a model where each thread has its own queue. This
> would reduce lock contention on the shared queue. This workload only suits
> the stages that have no variance, else you run into thread starvation. Some
> stages that this might work: ROW-MUTATION.
> 2) Set cpu affinity for each thread in each stage. If we can pin threads to
> specific cores, and control the workflow of a message from Thrift down to
> each stage, we should see improvements on reducing L1 cache misses. We would
> need to build a JNI extension (to set cpu affinity), as I could not find
> anywhere in JDK where it was exposed.
> 3) Batching the delivery of requests across stage boundaries. Peter Schuller
> hasn't looked deep enough yet into the JDK, but he thinks there may be
> significant improvements to be had there. Especially in high-throughput
> situations. If on each consumption you were to consume everything in the
> queue, rather than implying a synchronization point in between each request.
--
This message was sent by Atlassian JIRA
(v6.1#6144)