[ 
https://issues.apache.org/jira/browse/CASSANDRA-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914802#comment-17914802
 ] 

Dmitry Konstantinov commented on CASSANDRA-20176:
-------------------------------------------------

I have run some initial tests to compare SEP vs plain executors on CPU-bound 
workload:
 * [https://github.com/netudima/cassandra/tree/sep_benefits_testing-5.0] - the 
logic to disable SEP and use plain executors instead
 * single node cluster as a first step (I understand that it is a synthetic 
load and at least 3 nodes is more realistic)
 * write only workload (small inserts) is generated using cassandra stress:
{code:java}
./tools/bin/cassandra-stress "write n=10m" -rate threads=100  -node somenode
{code}

 * OpenJDK jdk-17.0.12+7
 * Linux kernel: 4.18.0-240.el8.x86_64 (quite an old version but it is that I 
have as of now)
 * CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
 * RAM: 46GiB => 8GiB Xmx
 * commit log enabled
 * compaction disabled (to simplify analysis)

h2. Disabled SEP results:
{code:java}
Results:
Op rate                   :   95,942 op/s  [WRITE: 95,942 op/s]
Partition rate            :   95,942 pk/s  [WRITE: 95,942 pk/s]
Row rate                  :   95,942 row/s [WRITE: 95,942 row/s]
Latency mean              :    1.0 ms [WRITE: 1.0 ms]
Latency median            :    0.8 ms [WRITE: 0.8 ms]
Latency 95th percentile   :    2.1 ms [WRITE: 2.1 ms]
Latency 99th percentile   :    6.9 ms [WRITE: 6.9 ms]
Latency 99.9th percentile :   14.1 ms [WRITE: 14.1 ms]
Latency max               :   91.6 ms [WRITE: 91.6 ms]
Total partitions          : 10,000,000 [WRITE: 10,000,000]
Total errors              :          0 [WRITE: 0]
Total GC count            : 26
Total GC memory           : 96.479 GiB
Total GC time             :    1.4 seconds
Avg GC time               :   53.0 ms
StdDev GC time            :   13.7 ms
Total operation time      : 00:01:44
{code}
Increasing of client threads to 300 and increasing limits for in-progress 
requests memory amount:
{code:java}
native_transport_max_request_data_in_flight_per_ip: 1024MiB
native_transport_max_request_data_in_flight: 1024MiB
{code}
allow squeeze only a bit more:
{code:java}
Results:
Op rate                   :  111,402 op/s  [WRITE: 111,402 op/s]
Partition rate            :  111,402 pk/s  [WRITE: 111,402 pk/s]
Row rate                  :  111,402 row/s [WRITE: 111,402 row/s]
Latency mean              :    2.6 ms [WRITE: 2.6 ms]
Latency median            :    2.6 ms [WRITE: 2.6 ms]
Latency 95th percentile   :    3.8 ms [WRITE: 3.8 ms]
Latency 99th percentile   :    7.0 ms [WRITE: 7.0 ms]
Latency 99.9th percentile :   33.5 ms [WRITE: 33.5 ms]
Latency max               :  220.1 ms [WRITE: 220.1 ms]
Total partitions          : 10,000,000 [WRITE: 10,000,000]
Total errors              :          0 [WRITE: 0]
Total GC count            : 40
Total GC memory           : 98.016 GiB
Total GC time             :    1.5 seconds
Avg GC time               :   37.8 ms
StdDev GC time            :   11.2 ms
Total operation time      : 00:01:29
{code}
note: CPU on cassandra stress host is ~50%, it is not a bottleneck

vmstat output for Cassandra server host:
{code:java}
vmstat -w 1
procs -----------------------memory---------------------- ---swap-- -----io---- 
-system-- --------cpu--------
 r  b         swpd         free         buff        cache   si   so    bi    bo 
  in   cs  us  sy  id  wa  st
39  0            0      2773720         1044     34560000    0    0   763   325 
   0    0   7   3  89   0   0
14  0            0      2718288         1044     34590072    0    0     0 29628 
298902 338704  31  19  50   0   0
14  0            0      2652708         1044     34617108    0    0     0  3050 
237005 323891  35  17  48   0   0
15  0            0      2477688         1044     34665624    0    0     0 33114 
283984 303530  47  19  33   0   0
 9  0            0      2315192         1044     34750520    0    0     0 32451 
264911 283519  49  17  35   0   0
 9  0            0      2183728         1044     34841084    0    0     0 33239 
298299 318187  40  19  42   0   0
 6  0            0      2062132         1044     34932676    0    0     0 31739 
297409 326448  38  19  43   0   0
18  0            0      1928896         1044     35019408    0    0     0 14340 
261117 316019  38  18  44   0   0
15  0            0      1867232         1044     35111216    0    0     0 19203 
279067 335133  38  19  43   0   0
17  0            0      1756780         1044     35197336    0    0     0 32711 
299355 316191  39  18  43   0   0
24  1            0      1648816         1044     35264260    0    0     0 
460065 287125 320040  35  21  43   1   0
29  0            0      2026240         1044     34902500    0    0     0 33083 
307611 348982  33  20  48   0   0
20  0            0      1996572         1044     34930312    0    0     0 32963 
292780 326634  34  18  48   0   0
24  0            0      1965136         1044     34957816    0    0     0 32263 
300858 336901  31  20  49   0   0
10  0            0      1933384         1044     34988568    0    0     0 33019 
302443 348119  34  19  47   0   0
{code}
sjk ttop output: [^ttop_disabled_sep.txt]
h2. Enabled SEP results

Note: default native_transport_max_request_data_in_flight settings and 100 
cassandra stress threads are used.
{code:java}
Results:
Op rate                   :  165,720 op/s  [WRITE: 165,720 op/s]
Partition rate            :  165,720 pk/s  [WRITE: 165,720 pk/s]
Row rate                  :  165,720 row/s [WRITE: 165,720 row/s]
Latency mean              :    0.6 ms [WRITE: 0.6 ms]
Latency median            :    0.5 ms [WRITE: 0.5 ms]
Latency 95th percentile   :    0.9 ms [WRITE: 0.9 ms]
Latency 99th percentile   :    1.3 ms [WRITE: 1.3 ms]
Latency 99.9th percentile :    9.2 ms [WRITE: 9.2 ms]
Latency max               :  180.7 ms [WRITE: 180.7 ms]
Total partitions          : 10,000,000 [WRITE: 10,000,000]
Total errors              :          0 [WRITE: 0]
Total GC count            : 26
Total GC memory           : 90.507 GiB
Total GC time             :    1.5 seconds
Avg GC time               :   58.1 ms
StdDev GC time            :   19.3 ms
Total operation time      : 00:01:00
{code}
vmstat output for Cassandra server host:
{code:java}
procs -----------------------memory---------------------- ---swap-- -----io---- 
-system-- --------cpu--------
 r  b         swpd         free         buff        cache   si   so    bi    bo 
  in   cs  us  sy  id  wa  st
 2  0            0      3566380         1044     33307176    0    0     0 64941 
417372 287103  55  23  22   0   0
22  0            0      3450476         1044     33403660    0    0     0 29021 
288822 277355  56  20  24   0   0
21  0            0      3325224         1044     33503448    0    0     0 37780 
348275 297448  55  22  23   0   0
20  1            0      3207672         1044     33585812    0    0     0 
462792 338309 292668  52  24  24   1   0
25  0            0      3799976         1044     33166264    0    0     0 60360 
366850 273934  50  22  28   0   0
24  0            0      3764116         1044     33211860    0    0    32 34136 
344039 303024  56  22  22   0   0
20  0            0      3677188         1044     33313384    0    0     0 32604 
328272 290207  55  22  23   0   0
23  0            0      3587144         1044     33404720    0    0     0 65591 
399538 265509  55  22  24   0   0
20  0            0      3457116         1044     33510808    0    0     0 33238 
331451 290611  57  22  21   0   0
16  0            0      3326136         1044     33614532    0    0     0 32619 
328778 292760  55  22  23   0   0
10  0            0      3187764         1044     33712116    0    0     0 48114 
356026 269176  56  22  22   0   0
22  0            0      3050616         1044     33814616    0    0     0 48898 
356712 288904  55  23  22   0   0
21  1            0      2906848         1044     33916928    0    0     0 33926 
325947 289436  55  22  22   0   0
23  2            0      3258484         1044     33553496    0    0     0 
459752 315801 284429  52  23  24   0   0
17  0            0      3213116         1044     33598508    0    0     0 64779 
424400 310141  52  24  24   0   0
15  0            0      3168688         1044     33643492    0    0     0 33616 
346234 308905  53  23  24   0   0
{code}
sjk ttop output: [^ttop_enabled_sep.txt]

So, in these initial tests non-SEP option gives worse latency and throughput 
values.

In case of SEP CPU load is distributed in a non-equal way between threads 
within a thread pool and we have smaller number of threads involved into the 
processing compared to SEP, where it is very balanced. In case of non-SEP we 
+probably+ get a higher contention on the thread pool blocking queues. Maybe 
for non-SEP option reducing the number of threads compared to default could 
help but I have not checked it yet.

For non-CEP option we are stuck in 50-60% CPU usage range, even with increased 
number of stress client threads and increased 
native_transport_max_request_data_in_flight limits. 

> Reduce memory allocation in SEP Worker spin wait logic
> ------------------------------------------------------
>
>                 Key: CASSANDRA-20176
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20176
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Other
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>         Attachments: image-2025-01-01-13-14-02-562.png, 
> image-2025-01-01-13-15-16-767.png, ttop_disabled_sep.txt, ttop_enabled_sep.txt
>
>
> There is a visible memory allocation within spin waiting logic in SEP 
> Executor: org.apache.cassandra.concurrent.SEPWorker#doWaitSpin for some 
> workloads. For example it is observed for a writing test described in 
> CASSANDRA-20165 where ~8.5% of total allocations are from this logic:
> !image-2025-01-01-13-14-02-562.png|width=570!
> !image-2025-01-01-13-15-16-767.png|width=570!
> The idea of this parking is to avoid unpark signalling costs. The logic 
> selects a random time period to park a thread by LockSupport.parkNanos and 
> put the thread into a ConcurrentSkipListMap using wake up time as a key, so 
> the map is used as a concurrent priority queue. Once the parking is finished 
> - the thread removes itself from the map. When we neede to schedule a task - 
> we take a spinning thread with the smallest wake up time from the map.
> We can try to implement another algorithm for this logic without memory 
> allocation overheads, for example based on a Timing Wheel data structure.
> Note: it also makes sense to check granularity of actual parking time 
> (https://hazelcast.com/blog/locksupport-parknanos-under-the-hood-and-the-curious-case-of-parking/)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to