[
https://issues.apache.org/jira/browse/CASSANDRA-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299303#comment-17299303
]
Tom Whitmore commented on CASSANDRA-16499:
------------------------------------------
Hi [~benedict], [~clohfink],
I was able to do some testing on Linux, and got some positive results. I was
able to find *significant* (+30.9%) throughput improvements in the
single-thread case, with improvements varying from small to marginal over 10,
50 and 200 thread cases.
Tests were with an m5d.2xlarge EC2 instance with 300 GB local SSD, running
Cassandra 4 beta 4 on Amazon Linux and Java 8.
Results from Cassandra Stress:
||Threads||Baseline Op/s||Baseline Latency mean||Baseline Latency
p99.9||Patched Op/s||Patched Latency mean||Patched Latency p99.9||Op/s
Difference||
|1|4979|0.2|0.3|6520|0.1|0.2|+30.9%|
|10|33220|0.3|1.35|33882|0.3|1.4|+2.0%|
|50|49686|1|24.3|54473|0.9|16.5|+9.6%|
|200|65646|3|94.3|67303|2.9|91.8|+2.5%|
Benefits drop off at larger numbers of cassandra-stress threads, however
latencies still seem to be comparable or slightly better.
See spreadsheet & detailed test results:
* [^MaybeStartSpinning Unpark fix; Linux benchmarks -- 07.xlsx]
* [^AMI Linux test -- 09.txt]
My intuition regarding unparking threads on enqueuing a task, was that it was
something that logically might well apply across platforms – not just Windows.
We now have the following test results:
* +30.9% single-thread throughput on Linux
* improved latency on Linux under single-thread/ low load conditions (-50%
mean, -33% p99.9; though measurements aren't very accurate)
* small to marginal improvements on Linux for more threads (10, 50, 200)
* a basic "smell test" of load-test latencies (p99, p99.9) etc on Linux up to
200 cassandra-stress threads suggest it's plausibly stable
* 18x faster single-thread throughput on Windows
* 5x faster 10-thread throughput on Windows
* improvements for Windows seem so big as to potentially make a previously
unviable platform, now viable as to performance.
Given this evidence, could it now make sense to consider this as a serious
enhancement candidate to incorporate into SEPExecutor for all platforms?
> single-threaded write workloads can spend ~70% time waiting on SEPExecutor
> --------------------------------------------------------------------------
>
> Key: CASSANDRA-16499
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16499
> Project: Cassandra
> Issue Type: Bug
> Reporter: Tom Whitmore
> Priority: Normal
> Attachments: AMI Linux test -- 09.txt, Cassandra Write trace 5;
> warmed up -- 02.txt, MaybeStartSpinning Unpark fix on 4beta4; Cassandra
> Stress results -- 01.txt, MaybeStartSpinning Unpark fix; Cassandra Stress
> results -- 02.txt, MaybeStartSpinning Unpark fix; Linux benchmarks --
> 07.xlsx, SEPWorker trace 2 delay examples -- 01.txt, SEPWorker trace 2
> delays.txt, SEPWorker trace 3 delays; with proposed fix.txt, Single-thread
> Latencies report -- 01.xlsx, Stress Write 2 sgl-thread vs 10 threads --
> 01.txt, Stress Write sgl-thread 1 -- 01.txt, Stress Write trace 1.txt,
> proposed fix patches.zip, tracing & experimental change patches.zip
>
>
> Hi all! While conducting benchmarking of Cassandra against other databases
> for a particular healthcare solution, I found some surprising anomalies in
> single-threaded write performance.
> Analysis & tracing suggest there might be some inefficiencies in inter-thread
> execution in Cassandra;
> * Tracing showed an average delay of 1.52 ms between
> StorageProxy.performLocally() being called, and the LocalMutationRunnable
> actually executing.
> * Total operation time averaged 2.06 ms (measured at Message.Dispatcher
> processRequest()). This suggested ~72% of the +total operation time+ being
> lost waiting for thread scheduling in SEPExecutor.
> * With I tested with multiple threads, performance with 10 threads was 27x
> higher. This supports a hypothesis that scheduling delays may be hindering
> single-threaded progress.
> * Transaction throughput for Cassandra with a single-threaded workload,
> measured far lower than that of PostgreSQL on the same hardware. Postgres
> achieved ~200k committed transactions/ minute including fsync; Cassandra
> achieves ~37k per minute. Given they are both essentially writing to a commit
> log, it may be informative to understand why the differences are arising.
> Cassandra's architecture seems in theory like it might be aligned for our
> usecase, given the Commit Log and Log Structured Merge design. Some of our
> customers have configurations posing high single-threaded loads. Therefore I
> spent some time trying to understand why efficiency for such loads seemed
> less than expected.
> My investigation so far:
> * benchmarked Cassandra 3.11.10
> * stack-dumped it under load & identified a pattern of threads waiting in
> AbstractWriteResponseHandler while nothing else is busy
> * checked out Cassandra 3.11.10 source, built it, debugged & stepped
> through key areas to try and understand behavior.
> * instrumented key areas with custom tracing code & timestamps to 0.01
> millisecond.
> ** _see patch attached._
> * benchmarked Cassandra 4 beta 4 & verified delays also present
> * shown & traced delays with my healthcare scenario benchmark
> * shown & traced delays with the +Cassandra stress-test+ tool.
> The configuration was:
> * single-node Cassandra running locally, on a recent Dell laptop with NVmE
> SSD.
> * for the healthcare scenario:
> ** Java client app running 1 or 10 threads;
> ** trialled LOCAL_ONE and ANY consistency levels;
> ** trialled unbatched, BatchType.UNLOGGED and BatchType.LOGGED.
> * for 'cassandra-stress':
> ** cassandra-stress.bat write n=10000 -rate threads=1
> Without deeply understanding the code, I have considered a couple of possible
> areas/ ideas as to improvement. I worked on the 3.11.10 codebase. I'd be
> interested to understand whether or not these might be sound or not; note
> that neither achieves as much improvement as might theoretically be hoped for.
> My investigations based on the key observation of large delays between
> StorageProxy.performLocally() being invoked and the LocalMutationRunnable
> actually executing, for single-threaded workloads.
> What I looked at:
> * Without fully understanding SEPExecutor.takeWorkPermit() – it answers true
> to execute immediately, false if scheduled – for this workload scheduling
> seemed slow.
> ** takeWorkPermit() answers false if no work permits are available.
> ** I noticed takeWorkPermit() also answers false if no task permits are
> available, +even if no task permit need be taken.+
> ** by changing this condition I was able to gain +45% performance.
> * Without deeply understanding SEP Executor/ Worker or sleep algorithms, I
> noted that in a single-thread workload SEPWorkers would likely spin & be put
> to sleep for a period after completing each task.
> ** I did wonder if the park -times- or parking behavior (empirically
> observed at 10,000 - 20,000 nanos) could contribute to threads being more
> aggressively de-scheduled.
> ** an experiment in keeping 1 SEPWorker awake (not sleeping at all) gained
> +7.9% performance.
> ** _Note: initial ticket misread code as requesting 500,000 nanosecond
> sleeps. This has now been corrected._
> * Without deeply understanding SEP Executor/ Worker, I feel there may be
> more questions around how SEP Workers are brought out of SPINNING/ sleep
> state and whether this logic functions promptly & correctly.
> ** At a very initial stage of investigation: +SEPWorker.assign() unparks
> threads when transitioning out of STOPPED state, but code appears potentially
> not to unpark threads coming out of SPINNING state.+
> ** _This is a very cursory "looking at the code" & initial debugging stage,
> but I'm not certain it's accurate._ _Attempted experiments to unpark for the
> SPINNING -> Work transition so far_ _have_ _caused lockups of 100% machine
> CPU use or dropped messages, rather than helping anything._
> ** _If & when I can find out more, I'll post it here._
> I will post the tracing code & traces I captured, and welcome some feedback
> and thoughts on these performance questions from the Cassandra dev community.
> Thanks all!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]