[jira] [Commented] (CASSANDRA-16499) single-threaded write workloads can spend ~70% time waiting on SEPExecutor

Tom Whitmore (Jira) Tue, 09 Mar 2021 16:25:07 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298449#comment-17298449
 ]


Tom Whitmore commented on CASSANDRA-16499:
------------------------------------------

Hi [~benedict], [~clohfink],

I think I've found something extremely interesting and would like some 
confirmation. 

My observations:
 * A detailed trace of task scheduling behavior (particularly whether a 
spinning worker was already available/ or not) and resultant scheduling delays, 
suggested that delays might be greater when a spinning worker was already 
available and an extra one was thus not started.
 * Spinning workers often enter a 'parked' state. I had already noted a lack of 
unpark() calls to bring worker threads out of 'SPINNING' state. (unpark() 
appears only used, at least in 3.11.10, to bring threads out of 'STOPPED' 
state.)
 * I noted that SEPExecutor.addTask() and 
SharedExecutorPool.maybeStartSpinningWorker() did not wake a sleeping worker, 
if one was already in SPINNING state.

I therefore wondered whether changing 
SharedExecutorPool.maybeStartSpinningWorker() to unpark the first available 
SPINNING worker might possibly help tasks to be scheduled more promptly.

I have made such a change, on 3.11.10 code, and (running locally on Windows) 
get some very surprising performance results.
h3. Results Obtained

Testing using Cassandra Stress-Test with single-threaded & 10-thread workloads:
||Version/ Fix||1 thread op/s||1 thread latency mean||10 thread op/s||10 thread 
latency mean||
|3.11.10 baseline|362 op/s|2.6 ms|5,696 op/s|1.7 ms|
|with MaybeStartSpinning Unpark fix|6,809 op/s|0.1 ms|24,639 op/s|0.4 ms|
| |_18.8x faster_| |_4.3x faster_| |

I am also seeing major improvements in my healthcare benchmark (12x throughput 
single-threaded, 5x throughput with 10 threads).
 Analysis of delays using my tracing code also showed a significant reduction 
in scheduling delays.

I did also try an additional change to unpark spinning workers in 
SEPWorker.assign() when others assigned work to a spinning state, but this 
showed a slight decrease in throughput so I have not included it.

Please see patch & results attached:
 * [^proposed fix patches.zip] on 3.11.10 codebase
 * [^MaybeStartSpinning Unpark fix; Cassandra Stress results -- 02.txt]

Could you please try replicating these results using Cassandra Stress? Testing 
requires running just two 'cassandra-stress' tests against a single-node local 
cluster. (Note that I had to up the counts, since single-threaded test was 
otherwise completing in <2 seconds.)
 * cassandra-stress.bat write n=50000 -rate threads=1
 * cassandra-stress.bat write n=500000 -rate threads=10

I believe that even given a 0.5 ms datasecond hop in typical deployments, this 
reduction in latency could potentially be very interesting in terms of 
performance. I note these results do seem surprisingly large, so I would be 
very interested to have you replicate them (baseline vs. with patch) and to see 
results on other environments.

Let me know!


Regards,
Tom

> single-threaded write workloads can spend ~70% time waiting on SEPExecutor
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16499
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16499
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Tom Whitmore
>            Priority: Normal
>         Attachments: Cassandra Write trace 5;  warmed up -- 02.txt, 
> MaybeStartSpinning Unpark fix; Cassandra Stress results -- 02.txt, SEPWorker 
> trace 2 delay examples -- 01.txt, SEPWorker trace 2 delays.txt, SEPWorker 
> trace 3 delays;  with proposed fix.txt, Single-thread Latencies report -- 
> 01.xlsx, Stress Write 2 sgl-thread vs 10 threads -- 01.txt, Stress Write 
> sgl-thread 1 -- 01.txt, Stress Write trace 1.txt, proposed fix patches.zip, 
> tracing & experimental change patches.zip
>
>
> Hi all! While conducting benchmarking of Cassandra against other databases 
> for a particular healthcare solution, I found some surprising anomalies in 
> single-threaded write performance. 
> Analysis & tracing suggest there might be some inefficiencies in inter-thread 
> execution in Cassandra;
>  * Tracing showed an average delay of 1.52 ms between 
> StorageProxy.performLocally() being called, and the LocalMutationRunnable 
> actually executing.
>  * Total operation time averaged 2.06 ms (measured at Message.Dispatcher 
> processRequest()). This suggested ~72% of the +total operation time+ being 
> lost waiting for thread scheduling in SEPExecutor.
>  * With I tested with multiple threads, performance with 10 threads was 27x 
> higher. This supports a hypothesis that scheduling delays may be hindering 
> single-threaded progress.
>  * Transaction throughput for Cassandra with a single-threaded workload, 
> measured far lower than that of PostgreSQL on the same hardware. Postgres 
> achieved ~200k committed transactions/ minute including fsync; Cassandra 
> achieves ~37k per minute. Given they are both essentially writing to a commit 
> log, it may be informative to understand why the differences are arising.
> Cassandra's architecture seems in theory like it might be aligned for our 
> usecase, given the Commit Log and Log Structured Merge design. Some of our 
> customers have configurations posing high single-threaded loads. Therefore I 
> spent some time trying to understand why efficiency for such loads seemed 
> less than expected.
> My investigation so far:
>  * benchmarked Cassandra 3.11.10
>  * stack-dumped it under load & identified a pattern of threads waiting in 
> AbstractWriteResponseHandler while nothing else is busy
>  * checked out Cassandra 3.11.10 source, built it, debugged  & stepped 
> through key areas to try and understand behavior.
>  * instrumented key areas with custom tracing code & timestamps to 0.01 
> millisecond.
>  ** _see patch attached._
>  * benchmarked Cassandra 4 beta 4 & verified delays also present
>  * shown & traced delays with my healthcare scenario benchmark
>  * shown & traced delays with the +Cassandra stress-test+ tool.
> The configuration was:
>  * single-node Cassandra running locally, on a recent Dell laptop with NVmE 
> SSD.
>  * for the healthcare scenario:
>  ** Java client app running 1 or 10 threads;
>  ** trialled LOCAL_ONE and ANY consistency levels;
>  ** trialled unbatched, BatchType.UNLOGGED and BatchType.LOGGED.
>  * for 'cassandra-stress':
>  ** cassandra-stress.bat write n=10000 -rate threads=1
> Without deeply understanding the code, I have considered a couple of possible 
> areas/ ideas as to improvement. I worked on the 3.11.10 codebase. I'd be 
> interested to understand whether or not these might be sound or not; note 
> that neither achieves as much improvement as might theoretically be hoped for.
> My investigations based on the key observation of large delays between 
> StorageProxy.performLocally() being invoked and the LocalMutationRunnable 
> actually executing, for single-threaded workloads.
> What I looked at:
>  * Without fully understanding SEPExecutor.takeWorkPermit() – it answers true 
> to execute immediately, false if scheduled – for this workload scheduling 
> seemed slow.
>  ** takeWorkPermit() answers false if no work permits are available.
>  ** I noticed takeWorkPermit() also answers false if no task permits are 
> available, +even if no task permit need be taken.+
>  ** by changing this condition I was able to gain +45% performance.
>  * Without deeply understanding SEP Executor/ Worker or sleep algorithms, I 
> noted that in a single-thread workload SEPWorkers would likely spin & be put 
> to sleep for a period after completing each task.
>  ** I did wonder if the park -times- or parking behavior (empirically 
> observed at 10,000 - 20,000 nanos) could contribute to threads being more 
> aggressively de-scheduled.
>  ** an experiment in keeping 1 SEPWorker awake (not sleeping at all) gained 
> +7.9% performance.
>  ** _Note: initial ticket misread code as requesting 500,000 nanosecond 
> sleeps. This has now been corrected._
>  * Without deeply understanding SEP Executor/ Worker, I feel there may be 
> more questions around how SEP Workers are brought out of SPINNING/ sleep 
> state and whether this logic functions promptly & correctly.
>  ** At a very initial stage of investigation: +SEPWorker.assign() unparks 
> threads when transitioning out of STOPPED state, but code appears potentially 
> not to unpark threads coming out of SPINNING state.+
>  ** _This is a very cursory "looking at the code" & initial debugging stage, 
> but I'm not certain it's accurate._ _Attempted experiments to unpark for the 
> SPINNING -> Work transition so far_ _have_ _caused lockups of 100% machine 
> CPU use or dropped messages, rather than helping anything._
>  ** _If & when I can find out more, I'll post it here._
> I will post the tracing code & traces I captured, and welcome some feedback 
> and thoughts on these performance questions from the Cassandra dev community. 
> Thanks all!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-16499) single-threaded write workloads can spend ~70% time waiting on SEPExecutor

Reply via email to