[jira] [Updated] (CASSANDRA-21345) avoid listing files in LogRecord when creating a new SSTableWriter for flush

Matt Byrd (Jira) Tue, 05 May 2026 07:48:08 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-21345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matt Byrd updated CASSANDRA-21345:
----------------------------------
    Attachment: 
new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html
                
new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html

> avoid listing files in LogRecord when creating a new SSTableWriter for flush
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21345
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21345
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction, Local/Memtable, Local/Startup and 
> Shutdown
>            Reporter: Matt Byrd
>            Assignee: Matt Byrd
>            Priority: Normal
>         Attachments: flushwriter-stacktrace, 
> new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html,
>  
> new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html,
>  
> old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html,
>  
> old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html
>
>
> When flushing a memtable to disk, we create a LogRecord by attempting to list 
> the entire directory to see how many files match the theoretical prefix for 
> sstables we'll be creating.
> This is potentially quite expensive, e.g when using LCS and having O(100k) 
> files in a given directory, see attached stack-trace.
> It also doesn't seem strictly necessary since at flush time, no such sstable 
> component files should exist yet. The other point in time this method is 
> called is on remove, initially to just construct a LogRecord to locate the 
> initial ADD LogRecord (and the remove construction does another directory 
> listing independently).
> The file listing is used in a couple of ways for LogRecord construction, both 
> to get the count of components we expect, as well also the last updated 
> timestamp of said components.
> For ADD records we hardcode the last updated timestamp to 0 and the component 
> count is derived via effectively max(expected component count, any files 
> found on disk with prefix), see:
> {code:java}
>         return make(type, getExistingFiles(absoluteTablePath), 
> table.getAllFilePaths().size(), absoluteTablePath);
> {code}
> and then eventually the number of files found on disk via positive timestamp:
> {code:java}
>  LogRecord(type, absolutePath, lastModified, Math.max(minFiles, 
> positiveTSCount));
> {code}
> I'm interested in seeing if we can instead simply always using the expected 
> component count on flush (minFiles), this leads to deterministic behavior in 
> both call sites.
> We likely need to reason carefully about this change given potential 
> correctness concerns.
> However I'm optimistic that it may be feasible, since for one we already 
> assert right before creating the ADD record for flushing, that there aren't 
> already some components on disk:
> {code:java}
>         Set<Component> existingComponents = Sets.filter(components, c -> 
> descriptor.fileFor(c).exists());
>         assert existingComponents.isEmpty() : String.format("Cannot create a 
> new SSTable in directory %s as component files %s already exist there",
>                                                             
> descriptor.directory,
>                                                             
> existingComponents);
>         txn.trackNew(this);
> {code}
>  
> There are maybe some strange edge-cases to reason about where a bug or 
> operator error  results in placing random components that are somehow 
> disjoint or misnamed in the directory prior to the ADD record being created, 
> between it's creation and the search for it in the REMOVE record and 
> thereafter. 
> A lot of startup and recovery mechanism either ignore ADD files, or only make 
> use of their absolute path.
> If anyone has any inklings of problems which may arise or other approaches 
> please let me know.
> I suppose one compromise solution is to fall-back on potentially rather 
> listing the entire directory, just searching explicitly for any of the 
> possible components which could exist via e.g a handful of explicit stat sys 
> calls.
> I feel like this would be quite a lot more complicated for potentially little 
> to no benefit but we could opt to do that instead I we find some concrete 
> problems with just taking the component count.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-21345) avoid listing files in LogRecord when creating a new SSTableWriter for flush

Reply via email to