[jira] [Commented] (CASSANDRA-21345) avoid listing files in LogRecord when creating a new SSTableWriter for flush

Matt Byrd (Jira) Tue, 02 Jun 2026 15:54:19 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-21345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085629#comment-18085629
 ]


Matt Byrd commented on CASSANDRA-21345:
---------------------------------------

another one  [^mbyrd_CASSANDRA-21345_another_ci_summary.html]  mostly accord 
things.
There seems to be fairly small overlap between them.

simulator.test.ShortPaxosSimulationTest::selfReconcileTest
simulator.test.ShortAccordSimulationTest::simulationTest-_jdk11
replace_address_test.TestReplaceAddress::test_replace_with_insufficient_replicas
client_request_metrics_test.TestClientRequestMetrics::test_client_request_metrics
bootstrap_test.TestBootstrap::test_consistent_range_movement_true_with_rf1_should_fail
bootstrap_test.TestBootstrap::test_consistent_range_movement_false_with_rf1_should_succeed


some automated analysis on those that are failing repeatedly:



  ---
  1. 
bootstrap_test.TestBootstrap::test_consistent_range_movement_false_with_rf1_should_succeed
 (python-dtests-latest)
  - CCM timeout: node3 never printed "Starting listening for CQL clients" 
within 90s. Pure infrastructure/environment flake — the node just didn't boot 
in time.

  ---
  2. 
bootstrap_test.TestBootstrap::test_consistent_range_movement_true_with_rf1_should_fail
 (python-dtests-latest)
  - CCM timeout, but different character: waited 600s for "Necessary replicas 
for strict consistency were removed by source filters" in system.log, never 
appeared. The log tail shows gossip messages timing out. The test is 
specifically checking for a log message that's supposed to be emitted when a 
consistent range movement is correctly rejected — if that
  message is no longer being logged, this could be a real behavioral regression 
on trunk (not your patch, but worth noting).

  ---
  3. 
client_request_metrics_test.TestClientRequestMetrics::test_client_request_metrics
 (python-dtests-latest)
  - KeyError: 'ContentionHistogram.Count' at 
client_request_metrics_test.py:382. The test retries 10 times in 
cas_contention() looking for this metric to appear in the diff between two 
CASClientRequestMetrics snapshots — it never does. This looks like a real 
missing metric: CASRead.ContentionHistogram is either not being tracked, not 
registered, or the name
  changed.

  ---
  4. 
replace_address_test.TestReplaceAddress::test_replace_with_insufficient_replicas
 (python-dtests-latest)
  - IllegalArgumentException: Multiple entries with same key in 
ReplicaGroups.Builder.build() (ReplicaGroups.java:440) → 
ApplyPlacementDeltas.execute() → LocalLog.processPendingInternal(). The two 
conflicting keys are:
    - (-9223372036854775808,-9064326250261401314] (epoch 15, node4/127.0.0.4 
being registered for replacement)
    - (-9223372036854775808,-9061276015282936864] (epoch 8, existing 3-node 
replica group)
  The replacement node's token splits an existing range, and 
PlacementDeltas.apply() ends up inserting both into the ImmutableSortedMap — 
they share the same start key so Guava's conflict check fires. This is a real 
bug in the placement delta application during node replacement.

  ---
  5. simulator.test.ShortAccordSimulationTest::simulationTest-_jdk11 
(jvm-dtest-simulator)
  - ClassCastException on seed 0xc83faa2defcf5f56: 
accord.coordinate.CoordinateSyncPoint$1 (loaded by InstanceClassLoader 
@506dcf55) cannot be cast to accord.utils.async.AsyncChain$Head (loaded by the 
app classloader). This is a same-class-different-classloader problem in the 
Accord simulator — CoordinateSyncPoint$1 leaks across classloader boundaries 
during
  Bootstrap.restart(). The flatMapResult → AsyncChains$AbstractChain.then() 
call at AsyncChains.java:523 does the cast.

  ---
  6. simulator.test.ShortPaxosSimulationTest::selfReconcileTest 
(jvm-dtest-simulator)
  - Bare JUnit timeout with no application stack trace. No useful diagnostic 
information available from this failure — just "Timeout occurred". Likely a 
slow/hung test scenario in the simulator.

  ---
  TL;DR: #1 is env noise. #6 is uninformative. The three most substantive ones 
are:
  - #4 (replace_with_insufficient_replicas) — looks like a real ReplicaGroups 
bug in the TCM placement layer, reproducible with node replacement
  - #5 (ShortAccordSimulationTest) — classloader isolation bug in the Accord 
simulator bootstrap path
  - #3 (test_client_request_metrics) — CASRead.ContentionHistogram metric 
appears missing

> avoid listing files in LogRecord when creating a new SSTableWriter for flush
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21345
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21345
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction, Local/Memtable, Local/Startup and 
> Shutdown
>            Reporter: Matt Byrd
>            Assignee: Matt Byrd
>            Priority: Normal
>         Attachments: ci_summary_21345_trunk.html, 
> ci_summary_CASSANDRA-21345_trunk.html, flushwriter-stacktrace, 
> mbyrd_CASSANDRA-21345_another_ci_summary.html, 
> mbyrd_CASSANDRA-21345_ci_summary.html, 
> new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html,
>  
> new-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html,
>  
> old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-false-flame-cpu-forward.html,
>  
> old-org.apache.cassandra.test.microbench.sstable.MemtableFlushBench.doFlushing-AverageTime-extraFileCount-200000-flushingSStableCount-10-preFillSStableCount-10-rowCount-1-skipCleanup-true-flame-cpu-forward.html
>
>
> When flushing a memtable to disk, we create a LogRecord by attempting to list 
> the entire directory to see how many files match the theoretical prefix for 
> sstables we'll be creating.
> This is potentially quite expensive, e.g when using LCS and having O(100k) 
> files in a given directory, see attached stack-trace.
> It also doesn't seem strictly necessary since at flush time, no such sstable 
> component files should exist yet. The other point in time this method is 
> called is on remove, initially to just construct a LogRecord to locate the 
> initial ADD LogRecord (and the remove construction does another directory 
> listing independently).
> The file listing is used in a couple of ways for LogRecord construction, both 
> to get the count of components we expect, as well also the last updated 
> timestamp of said components.
> For ADD records we hardcode the last updated timestamp to 0 and the component 
> count is derived via effectively max(expected component count, any files 
> found on disk with prefix), see:
> {code:java}
>         return make(type, getExistingFiles(absoluteTablePath), 
> table.getAllFilePaths().size(), absoluteTablePath);
> {code}
> and then eventually the number of files found on disk via positive timestamp:
> {code:java}
>  LogRecord(type, absolutePath, lastModified, Math.max(minFiles, 
> positiveTSCount));
> {code}
> I'm interested in seeing if we can instead simply always using the expected 
> component count on flush (minFiles), this leads to deterministic behavior in 
> both call sites.
> We likely need to reason carefully about this change given potential 
> correctness concerns.
> However I'm optimistic that it may be feasible, since for one we already 
> assert right before creating the ADD record for flushing, that there aren't 
> already some components on disk:
> {code:java}
>         Set<Component> existingComponents = Sets.filter(components, c -> 
> descriptor.fileFor(c).exists());
>         assert existingComponents.isEmpty() : String.format("Cannot create a 
> new SSTable in directory %s as component files %s already exist there",
>                                                             
> descriptor.directory,
>                                                             
> existingComponents);
>         txn.trackNew(this);
> {code}
>  
> There are maybe some strange edge-cases to reason about where a bug or 
> operator error  results in placing random components that are somehow 
> disjoint or misnamed in the directory prior to the ADD record being created, 
> between it's creation and the search for it in the REMOVE record and 
> thereafter. 
> A lot of startup and recovery mechanism either ignore ADD files, or only make 
> use of their absolute path.
> If anyone has any inklings of problems which may arise or other approaches 
> please let me know.
> I suppose one compromise solution is to fall-back on potentially rather 
> listing the entire directory, just searching explicitly for any of the 
> possible components which could exist via e.g a handful of explicit stat sys 
> calls.
> I feel like this would be quite a lot more complicated for potentially little 
> to no benefit but we could opt to do that instead I we find some concrete 
> problems with just taking the component count.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-21345) avoid listing files in LogRecord when creating a new SSTableWriter for flush

Reply via email to