[
https://issues.apache.org/jira/browse/CASSANDRA-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914802#comment-17914802
]
Dmitry Konstantinov commented on CASSANDRA-20176:
-------------------------------------------------
I have run some initial tests to compare SEP vs plain executors on CPU-bound
workload:
* [https://github.com/netudima/cassandra/tree/sep_benefits_testing-5.0] - the
logic to disable SEP and use plain executors instead
* single node cluster as a first step (I understand that it is a synthetic
load and at least 3 nodes is more realistic)
* write only workload (small inserts) is generated using cassandra stress:
{code:java}
./tools/bin/cassandra-stress "write n=10m" -rate threads=100 -node somenode
{code}
* OpenJDK jdk-17.0.12+7
* Linux kernel: 4.18.0-240.el8.x86_64 (quite an old version but it is that I
have as of now)
* CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
* RAM: 46GiB => 8GiB Xmx
* commit log enabled
* compaction disabled (to simplify analysis)
h2. Disabled SEP results:
{code:java}
Results:
Op rate : 95,942 op/s [WRITE: 95,942 op/s]
Partition rate : 95,942 pk/s [WRITE: 95,942 pk/s]
Row rate : 95,942 row/s [WRITE: 95,942 row/s]
Latency mean : 1.0 ms [WRITE: 1.0 ms]
Latency median : 0.8 ms [WRITE: 0.8 ms]
Latency 95th percentile : 2.1 ms [WRITE: 2.1 ms]
Latency 99th percentile : 6.9 ms [WRITE: 6.9 ms]
Latency 99.9th percentile : 14.1 ms [WRITE: 14.1 ms]
Latency max : 91.6 ms [WRITE: 91.6 ms]
Total partitions : 10,000,000 [WRITE: 10,000,000]
Total errors : 0 [WRITE: 0]
Total GC count : 26
Total GC memory : 96.479 GiB
Total GC time : 1.4 seconds
Avg GC time : 53.0 ms
StdDev GC time : 13.7 ms
Total operation time : 00:01:44
{code}
Increasing of client threads to 300 and increasing limits for in-progress
requests memory amount:
{code:java}
native_transport_max_request_data_in_flight_per_ip: 1024MiB
native_transport_max_request_data_in_flight: 1024MiB
{code}
allow squeeze only a bit more:
{code:java}
Results:
Op rate : 111,402 op/s [WRITE: 111,402 op/s]
Partition rate : 111,402 pk/s [WRITE: 111,402 pk/s]
Row rate : 111,402 row/s [WRITE: 111,402 row/s]
Latency mean : 2.6 ms [WRITE: 2.6 ms]
Latency median : 2.6 ms [WRITE: 2.6 ms]
Latency 95th percentile : 3.8 ms [WRITE: 3.8 ms]
Latency 99th percentile : 7.0 ms [WRITE: 7.0 ms]
Latency 99.9th percentile : 33.5 ms [WRITE: 33.5 ms]
Latency max : 220.1 ms [WRITE: 220.1 ms]
Total partitions : 10,000,000 [WRITE: 10,000,000]
Total errors : 0 [WRITE: 0]
Total GC count : 40
Total GC memory : 98.016 GiB
Total GC time : 1.5 seconds
Avg GC time : 37.8 ms
StdDev GC time : 11.2 ms
Total operation time : 00:01:29
{code}
note: CPU on cassandra stress host is ~50%, it is not a bottleneck
vmstat output for Cassandra server host:
{code:java}
vmstat -w 1
procs -----------------------memory---------------------- ---swap-- -----io----
-system-- --------cpu--------
r b swpd free buff cache si so bi bo
in cs us sy id wa st
39 0 0 2773720 1044 34560000 0 0 763 325
0 0 7 3 89 0 0
14 0 0 2718288 1044 34590072 0 0 0 29628
298902 338704 31 19 50 0 0
14 0 0 2652708 1044 34617108 0 0 0 3050
237005 323891 35 17 48 0 0
15 0 0 2477688 1044 34665624 0 0 0 33114
283984 303530 47 19 33 0 0
9 0 0 2315192 1044 34750520 0 0 0 32451
264911 283519 49 17 35 0 0
9 0 0 2183728 1044 34841084 0 0 0 33239
298299 318187 40 19 42 0 0
6 0 0 2062132 1044 34932676 0 0 0 31739
297409 326448 38 19 43 0 0
18 0 0 1928896 1044 35019408 0 0 0 14340
261117 316019 38 18 44 0 0
15 0 0 1867232 1044 35111216 0 0 0 19203
279067 335133 38 19 43 0 0
17 0 0 1756780 1044 35197336 0 0 0 32711
299355 316191 39 18 43 0 0
24 1 0 1648816 1044 35264260 0 0 0
460065 287125 320040 35 21 43 1 0
29 0 0 2026240 1044 34902500 0 0 0 33083
307611 348982 33 20 48 0 0
20 0 0 1996572 1044 34930312 0 0 0 32963
292780 326634 34 18 48 0 0
24 0 0 1965136 1044 34957816 0 0 0 32263
300858 336901 31 20 49 0 0
10 0 0 1933384 1044 34988568 0 0 0 33019
302443 348119 34 19 47 0 0
{code}
sjk ttop output: [^ttop_disabled_sep.txt]
h2. Enabled SEP results
Note: default native_transport_max_request_data_in_flight settings and 100
cassandra stress threads are used.
{code:java}
Results:
Op rate : 165,720 op/s [WRITE: 165,720 op/s]
Partition rate : 165,720 pk/s [WRITE: 165,720 pk/s]
Row rate : 165,720 row/s [WRITE: 165,720 row/s]
Latency mean : 0.6 ms [WRITE: 0.6 ms]
Latency median : 0.5 ms [WRITE: 0.5 ms]
Latency 95th percentile : 0.9 ms [WRITE: 0.9 ms]
Latency 99th percentile : 1.3 ms [WRITE: 1.3 ms]
Latency 99.9th percentile : 9.2 ms [WRITE: 9.2 ms]
Latency max : 180.7 ms [WRITE: 180.7 ms]
Total partitions : 10,000,000 [WRITE: 10,000,000]
Total errors : 0 [WRITE: 0]
Total GC count : 26
Total GC memory : 90.507 GiB
Total GC time : 1.5 seconds
Avg GC time : 58.1 ms
StdDev GC time : 19.3 ms
Total operation time : 00:01:00
{code}
vmstat output for Cassandra server host:
{code:java}
procs -----------------------memory---------------------- ---swap-- -----io----
-system-- --------cpu--------
r b swpd free buff cache si so bi bo
in cs us sy id wa st
2 0 0 3566380 1044 33307176 0 0 0 64941
417372 287103 55 23 22 0 0
22 0 0 3450476 1044 33403660 0 0 0 29021
288822 277355 56 20 24 0 0
21 0 0 3325224 1044 33503448 0 0 0 37780
348275 297448 55 22 23 0 0
20 1 0 3207672 1044 33585812 0 0 0
462792 338309 292668 52 24 24 1 0
25 0 0 3799976 1044 33166264 0 0 0 60360
366850 273934 50 22 28 0 0
24 0 0 3764116 1044 33211860 0 0 32 34136
344039 303024 56 22 22 0 0
20 0 0 3677188 1044 33313384 0 0 0 32604
328272 290207 55 22 23 0 0
23 0 0 3587144 1044 33404720 0 0 0 65591
399538 265509 55 22 24 0 0
20 0 0 3457116 1044 33510808 0 0 0 33238
331451 290611 57 22 21 0 0
16 0 0 3326136 1044 33614532 0 0 0 32619
328778 292760 55 22 23 0 0
10 0 0 3187764 1044 33712116 0 0 0 48114
356026 269176 56 22 22 0 0
22 0 0 3050616 1044 33814616 0 0 0 48898
356712 288904 55 23 22 0 0
21 1 0 2906848 1044 33916928 0 0 0 33926
325947 289436 55 22 22 0 0
23 2 0 3258484 1044 33553496 0 0 0
459752 315801 284429 52 23 24 0 0
17 0 0 3213116 1044 33598508 0 0 0 64779
424400 310141 52 24 24 0 0
15 0 0 3168688 1044 33643492 0 0 0 33616
346234 308905 53 23 24 0 0
{code}
sjk ttop output: [^ttop_enabled_sep.txt]
So, in these initial tests non-SEP option gives worse latency and throughput
values.
In case of SEP CPU load is distributed in a non-equal way between threads
within a thread pool and we have smaller number of threads involved into the
processing compared to SEP, where it is very balanced. In case of non-SEP we
+probably+ get a higher contention on the thread pool blocking queues. Maybe
for non-SEP option reducing the number of threads compared to default could
help but I have not checked it yet.
For non-CEP option we are stuck in 50-60% CPU usage range, even with increased
number of stress client threads and increased
native_transport_max_request_data_in_flight limits.
> Reduce memory allocation in SEP Worker spin wait logic
> ------------------------------------------------------
>
> Key: CASSANDRA-20176
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20176
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/Other
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Attachments: image-2025-01-01-13-14-02-562.png,
> image-2025-01-01-13-15-16-767.png, ttop_disabled_sep.txt, ttop_enabled_sep.txt
>
>
> There is a visible memory allocation within spin waiting logic in SEP
> Executor: org.apache.cassandra.concurrent.SEPWorker#doWaitSpin for some
> workloads. For example it is observed for a writing test described in
> CASSANDRA-20165 where ~8.5% of total allocations are from this logic:
> !image-2025-01-01-13-14-02-562.png|width=570!
> !image-2025-01-01-13-15-16-767.png|width=570!
> The idea of this parking is to avoid unpark signalling costs. The logic
> selects a random time period to park a thread by LockSupport.parkNanos and
> put the thread into a ConcurrentSkipListMap using wake up time as a key, so
> the map is used as a concurrent priority queue. Once the parking is finished
> - the thread removes itself from the map. When we neede to schedule a task -
> we take a spinning thread with the smallest wake up time from the map.
> We can try to implement another algorithm for this logic without memory
> allocation overheads, for example based on a Timing Wheel data structure.
> Note: it also makes sense to check granularity of actual parking time
> (https://hazelcast.com/blog/locksupport-parknanos-under-the-hood-and-the-curious-case-of-parking/)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]