[ 
https://issues.apache.org/jira/browse/CASSANDRA-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830306#comment-17830306
 ] 

Jon Haddad commented on CASSANDRA-19477:
----------------------------------------

I've set up a 3 node cluster, loaded 15GB of data then took down a node and let 
hints accumulate.  I switched one node to use the 4.1 patch branch above, and 
let the other node remain on release 4.1, then ran this:
{noformat}
easy-cass-stress run RandomPartitionAccess --workload.rows=1000 --rate 5k -d 2h 
-t 4{noformat}
Here's the 4.1 release flame graph.  
[^flame-cassandra0-release-2024-03-25_00-16-44.html]

StorageProxy.mutate is taking up 17% of CPU time, with shouldHint taking up 
almost 7% of CPU time.

Here's the 4.1 + patch flame graph: 
[^flame-cassandra0-patched-2024-03-25_00-40-47.html]

StorageProxy.mutate is only taking up 10% of CPU time now, with shouldHint 
taking up .26% of CPU time.

You can see the below graph 172.31.36.176 is using less CPU overall.

!image-2024-03-24-17-57-32-560.png|width=857,height=270!

 

Here's the same setup with additional load.
{noformat}
easy-cass-stress run RandomPartitionAccess --workload.rows=1000 --rate 30k -d 
2h -t 4{noformat}
!image-2024-03-24-18-08-36-918.png|width=749,height=302!

 

The improvement in this patch is fantastic, really nice work [~smiklosovic].  
I'm +1 with regard to performance, but deferring to [~aleksey] to judge 
correctness.

> Do not go to disk to get HintsStore.getTotalFileSize
> ----------------------------------------------------
>
>                 Key: CASSANDRA-19477
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19477
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Hints
>            Reporter: Jon Haddad
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 4.1.x, 5.0-rc, 5.x
>
>         Attachments: flame-cassandra0-patched-2024-03-25_00-40-47.html, 
> flame-cassandra0-release-2024-03-25_00-16-44.html, flamegraph.cpu.html, 
> image-2024-03-24-17-57-32-560.png, image-2024-03-24-18-08-36-918.png
>
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> When testing a cluster with more requests than it could handle, I noticed 
> significant CPU time (25%) spent in HintsStore.getTotalFileSize.  Here's what 
> I'm seeing from profiling:
> 10% of CPU time spent in HintsDescriptor.fileName which only does this:
>  
> {noformat}
> return String.format("%s-%s-%s.hints", hostId, timestamp, version);{noformat}
> At a bare minimum here we should create this string up front with the host 
> and version and eliminate 2 of the 3 substitutions, but I think it's probably 
> faster to use a StringBuilder and avoid the underlying regular expression 
> altogether.
> 12% of the time is spent in org.apache.cassandra.io.util.File.length.  It 
> looks like this is called once for each hint file on disk for each host we're 
> hinting to.  In the case of an overloaded cluster, this is significant.  It 
> would be better if we were to track the file size in memory for each hint 
> file and reference that rather than go to the filesystem.
> These fairly small changes should make Cassandra more reliable when under 
> load spikes.
> CPU Flame graph attached.
> I only tested this in 4.1 but it looks like this is present up to trunk.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to