[ 
https://issues.apache.org/jira/browse/CASSANDRA-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302187#comment-17302187
 ] 

Tom Whitmore commented on CASSANDRA-16499:
------------------------------------------

"The precise number of threads is not remotely comparable between test rigs" – 
I'd agree that's likely, even before comparing across years.

Thanks [~dcapwell] for the graphs. What seems a fairly important question to me 
– how many threads were workloads are tested with?
 * I've done a quick analysis in Excel (attached): [^analysis of David 
Capwell's latency stats -- 01.xlsx]
 * It shows improvements to the mean in all scenarios, from 2% to 8% to 15%.
 * Minimum latency is improved in most scenarios.
 * Latency at the very tail appears potentially _worsened –_ as David says, 
visually this appears to occur above p99.6 – with the maximum increasing from 
8% to 15% and, for the 'Large blobs, 20% reads' scenario, +50%.
 * Despite latencies being slightly worse at the far tail, significant latency 
improvements for broad mid-range more than make up for it;  with fairly 
significant overall improvements to the mean. 
 * I'm guessing here – but I assume these results were taken with large thread 
counts. The only less downside case I see of waking the spinning thread is, 
when another thread was already assigned and ready to work, that waking an 
additional spinner may result in 

"It is a core behaviour of the executor in question, specifically that the 
worker threads self-organise with limited scheduler interaction between 
producers and consumers.". Interesting:
 * This does raise the question, of how were single-threaded workloads – where 
there needs to be tight sequencing of work onto threads – intended to be 
handled, and how was the performance of such workloads benchmarked?
 * I can understand 'limited coordination' as a potentially desirable 
principle, but for this kind of workload it would seem to imply either 
inefficiency from spinning wait loops or delays in scheduling.

I'm not sure whether it would be elegant, but it might potentially be possible 
to combine a "limited interaction" regime of SEPExecutor behavior with a 
"producer signalled" interaction at higher thread counts.
 * Producer signalling (eg, waking a spinning thread) could be conditioned on 
thread numbers – so as to act when few threads are spinning or assigned, but 
become inactive under heavier workloads.
 * We do have a few possible numbers to work with – eg. the total # spinning, 
and per-executor the # of work and task permits.
 * I don't have a great insight into what metrics would give best thread 
execution efficiency – I'll just throw a possibility out there as a starting 
point.
 ** If we have Task Permits > 0 and no Worker Permits, we want to wake a 
spinner.
 ** If we have Worker Permits >= Task Permits, we don't need to wake a spinner.
 ** If we have Worker Permits >= Task Permits - 1, there is presumably some 
number of Workers (worker permits) beyond which the efficiency of waiting for a 
worker to become free or self-organize, is better than waking one up.
 ** So one possible condition to wake a spinner would be along the lines of:  
if (TaskPermits > (WorkerPermits + (WorkerPermits SHR <constant>))). With the 
question being, what is the constant?
 * _I'm interested in any thoughts, feedback or alternatives to this._

"By modifying the behaviour as proposed, logically there is no advantage to 
this executor's signalling approach over a plain {{ThreadPoolExecutor}}, and 
likely additional penalties".
 * Interesting analysis, thanks Benedict. This gives me a better understanding 
of why you are interested in comparison with the plain TPE.
 * Viewing the problem in this light, all options may be likely to involve at 
least a small degree of tradeoff. It seems plausible the SEP may well retain 
benefits for some workload regimes over the plain TPE. However it would seem 
desirable to improve single-thread performance, one way or another, as this 
currently seems the most problematic area.

I'd be interested to know if either the Cassandra project/ or DataStax have a 
standard performance/ performance regression testing suite for Cassandra; it 
would definitely seem worthwhile to ensure that single-thread workloads are 
sufficiently represented in this suite.

Regards,
Tom

> single-threaded write workloads can spend ~70% time waiting on SEPExecutor
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16499
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16499
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Tom Whitmore
>            Priority: Normal
>              Labels: performance
>         Attachments: AMI Linux test -- 09.txt, Cassandra Write trace 5;  
> warmed up -- 02.txt, MaybeStartSpinning Unpark fix on 4beta4; Cassandra 
> Stress results -- 01.txt, MaybeStartSpinning Unpark fix; Cassandra Stress 
> results -- 02.txt, MaybeStartSpinning Unpark fix; Linux benchmarks -- 
> 07.xlsx, SEPWorker trace 2 delay examples -- 01.txt, SEPWorker trace 2 
> delays.txt, SEPWorker trace 3 delays;  with proposed fix.txt, Screen Shot 
> 2021-03-15 at 10.55.05 AM.png, Screen Shot 2021-03-15 at 10.55.14 AM.png, 
> Screen Shot 2021-03-15 at 10.56.02 AM.png, Screen Shot 2021-03-15 at 10.56.08 
> AM.png, Screen Shot 2021-03-15 at 10.56.14 AM.png, Screen Shot 2021-03-15 at 
> 10.56.25 AM.png, Screen Shot 2021-03-15 at 10.57.22 AM.png, Screen Shot 
> 2021-03-15 at 10.57.31 AM.png, Screen Shot 2021-03-15 at 10.57.40 AM.png, 
> Screen Shot 2021-03-15 at 10.57.47 AM.png, Single-thread Latencies report -- 
> 01.xlsx, Stress Write 2 sgl-thread vs 10 threads -- 01.txt, Stress Write 
> sgl-thread 1 -- 01.txt, Stress Write trace 1.txt, analysis of David Capwell's 
> latency stats -- 01.xlsx, proposed fix patches.zip, tracing & experimental 
> change patches.zip
>
>
> Hi all! While conducting benchmarking of Cassandra against other databases 
> for a particular healthcare solution, I found some surprising anomalies in 
> single-threaded write performance. 
> Analysis & tracing suggest there might be some inefficiencies in inter-thread 
> execution in Cassandra;
>  * Tracing showed an average delay of 1.52 ms between 
> StorageProxy.performLocally() being called, and the LocalMutationRunnable 
> actually executing.
>  * Total operation time averaged 2.06 ms (measured at Message.Dispatcher 
> processRequest()). This suggested ~72% of the +total operation time+ being 
> lost waiting for thread scheduling in SEPExecutor.
>  * With I tested with multiple threads, performance with 10 threads was 27x 
> higher. This supports a hypothesis that scheduling delays may be hindering 
> single-threaded progress.
>  * Transaction throughput for Cassandra with a single-threaded workload, 
> measured far lower than that of PostgreSQL on the same hardware. Postgres 
> achieved ~200k committed transactions/ minute including fsync; Cassandra 
> achieves ~37k per minute. Given they are both essentially writing to a commit 
> log, it may be informative to understand why the differences are arising.
> Cassandra's architecture seems in theory like it might be aligned for our 
> usecase, given the Commit Log and Log Structured Merge design. Some of our 
> customers have configurations posing high single-threaded loads. Therefore I 
> spent some time trying to understand why efficiency for such loads seemed 
> less than expected.
> My investigation so far:
>  * benchmarked Cassandra 3.11.10
>  * stack-dumped it under load & identified a pattern of threads waiting in 
> AbstractWriteResponseHandler while nothing else is busy
>  * checked out Cassandra 3.11.10 source, built it, debugged  & stepped 
> through key areas to try and understand behavior.
>  * instrumented key areas with custom tracing code & timestamps to 0.01 
> millisecond.
>  ** _see patch attached._
>  * benchmarked Cassandra 4 beta 4 & verified delays also present
>  * shown & traced delays with my healthcare scenario benchmark
>  * shown & traced delays with the +Cassandra stress-test+ tool.
> The configuration was:
>  * single-node Cassandra running locally, on a recent Dell laptop with NVmE 
> SSD.
>  * for the healthcare scenario:
>  ** Java client app running 1 or 10 threads;
>  ** trialled LOCAL_ONE and ANY consistency levels;
>  ** trialled unbatched, BatchType.UNLOGGED and BatchType.LOGGED.
>  * for 'cassandra-stress':
>  ** cassandra-stress.bat write n=10000 -rate threads=1
> Without deeply understanding the code, I have considered a couple of possible 
> areas/ ideas as to improvement. I worked on the 3.11.10 codebase. I'd be 
> interested to understand whether or not these might be sound or not; note 
> that neither achieves as much improvement as might theoretically be hoped for.
> My investigations based on the key observation of large delays between 
> StorageProxy.performLocally() being invoked and the LocalMutationRunnable 
> actually executing, for single-threaded workloads.
> What I looked at:
>  * Without fully understanding SEPExecutor.takeWorkPermit() – it answers true 
> to execute immediately, false if scheduled – for this workload scheduling 
> seemed slow.
>  ** takeWorkPermit() answers false if no work permits are available.
>  ** I noticed takeWorkPermit() also answers false if no task permits are 
> available, +even if no task permit need be taken.+
>  ** by changing this condition I was able to gain +45% performance.
>  * Without deeply understanding SEP Executor/ Worker or sleep algorithms, I 
> noted that in a single-thread workload SEPWorkers would likely spin & be put 
> to sleep for a period after completing each task.
>  ** I did wonder if the park -times- or parking behavior (empirically 
> observed at 10,000 - 20,000 nanos) could contribute to threads being more 
> aggressively de-scheduled.
>  ** an experiment in keeping 1 SEPWorker awake (not sleeping at all) gained 
> +7.9% performance.
>  ** _Note: initial ticket misread code as requesting 500,000 nanosecond 
> sleeps. This has now been corrected._
>  * Without deeply understanding SEP Executor/ Worker, I feel there may be 
> more questions around how SEP Workers are brought out of SPINNING/ sleep 
> state and whether this logic functions promptly & correctly.
>  ** At a very initial stage of investigation: +SEPWorker.assign() unparks 
> threads when transitioning out of STOPPED state, but code appears potentially 
> not to unpark threads coming out of SPINNING state.+
>  ** _This is a very cursory "looking at the code" & initial debugging stage, 
> but I'm not certain it's accurate._ _Attempted experiments to unpark for the 
> SPINNING -> Work transition so far_ _have_ _caused lockups of 100% machine 
> CPU use or dropped messages, rather than helping anything._
>  ** _If & when I can find out more, I'll post it here._
> I will post the tracing code & traces I captured, and welcome some feedback 
> and thoughts on these performance questions from the Cassandra dev community. 
> Thanks all!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to