[
https://issues.apache.org/jira/browse/CASSANDRA-21345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matt Byrd updated CASSANDRA-21345:
----------------------------------
Attachment:
old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html
old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html
> avoid listing files in LogRecord when creating a new SSTableWriter for flush
> ----------------------------------------------------------------------------
>
> Key: CASSANDRA-21345
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21345
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/Compaction, Local/Memtable, Local/Startup and
> Shutdown
> Reporter: Matt Byrd
> Assignee: Matt Byrd
> Priority: Normal
> Attachments: flushwriter-stacktrace,
> new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html,
>
> new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html,
>
> old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html,
>
> old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html
>
>
> When flushing a memtable to disk, we create a LogRecord by attempting to list
> the entire directory to see how many files match the theoretical prefix for
> sstables we'll be creating.
> This is potentially quite expensive, e.g when using LCS and having O(100k)
> files in a given directory, see attached stack-trace.
> It also doesn't seem strictly necessary since at flush time, no such sstable
> component files should exist yet. The other point in time this method is
> called is on remove, initially to just construct a LogRecord to locate the
> initial ADD LogRecord (and the remove construction does another directory
> listing independently).
> The file listing is used in a couple of ways for LogRecord construction, both
> to get the count of components we expect, as well also the last updated
> timestamp of said components.
> For ADD records we hardcode the last updated timestamp to 0 and the component
> count is derived via effectively max(expected component count, any files
> found on disk with prefix), see:
> {code:java}
> return make(type, getExistingFiles(absoluteTablePath),
> table.getAllFilePaths().size(), absoluteTablePath);
> {code}
> and then eventually the number of files found on disk via positive timestamp:
> {code:java}
> LogRecord(type, absolutePath, lastModified, Math.max(minFiles,
> positiveTSCount));
> {code}
> I'm interested in seeing if we can instead simply always using the expected
> component count on flush (minFiles), this leads to deterministic behavior in
> both call sites.
> We likely need to reason carefully about this change given potential
> correctness concerns.
> However I'm optimistic that it may be feasible, since for one we already
> assert right before creating the ADD record for flushing, that there aren't
> already some components on disk:
> {code:java}
> Set<Component> existingComponents = Sets.filter(components, c ->
> descriptor.fileFor(c).exists());
> assert existingComponents.isEmpty() : String.format("Cannot create a
> new SSTable in directory %s as component files %s already exist there",
>
> descriptor.directory,
>
> existingComponents);
> txn.trackNew(this);
> {code}
>
> There are maybe some strange edge-cases to reason about where a bug or
> operator error results in placing random components that are somehow
> disjoint or misnamed in the directory prior to the ADD record being created,
> between it's creation and the search for it in the REMOVE record and
> thereafter.
> A lot of startup and recovery mechanism either ignore ADD files, or only make
> use of their absolute path.
> If anyone has any inklings of problems which may arise or other approaches
> please let me know.
> I suppose one compromise solution is to fall-back on potentially rather
> listing the entire directory, just searching explicitly for any of the
> possible components which could exist via e.g a handful of explicit stat sys
> calls.
> I feel like this would be quite a lot more complicated for potentially little
> to no benefit but we could opt to do that instead I we find some concrete
> problems with just taking the component count.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]