[ 
https://issues.apache.org/jira/browse/CASSANDRA-21452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon Haddad updated CASSANDRA-21452:
-----------------------------------
    Issue Type: Epic  (was: Improvement)

> Complete cursor compaction coverage (complex columns, BTI, wide schemas), 
> verified byte-for-byte against the iterator path
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21452
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21452
>             Project: Apache Cassandra
>          Issue Type: Epic
>          Components: Local/Compaction
>            Reporter: Jon Haddad
>            Assignee: Jon Haddad
>            Priority: Normal
>
> h2. Background
> CASSANDRA-20918 introduced cursor-based compaction ({{CursorCompactor}}, 
> {{SSTableCursorReader}}, {{SSTableCursorWriter}}): a streaming merge that 
> avoids the per-partition/row/cell object materialization of the iterator path 
> ({{CompactionIterator}}). Reported results: 2–5x faster compaction, up to 
> ~100x allocation reduction, with a fixed memory footprint per sstable.
> The motivating problem is CASSANDRA-20428: {{ByteArrayAccessor.read}} — the 
> {{new byte[length]}} per value performed by cell and clustering 
> deserialization — is a dominant allocation source wherever the iterator-based 
> read machinery runs, and compaction pays it for every cell of every sstable 
> it merges. The cursor path eliminates that allocation class entirely, but 
> only for the schemas and configurations it supports; everything else silently 
> falls back to the iterator path at {{AbstractCompactionPipeline.create}}.
> h2. What was missing
> As merged, {{CursorCompactor.isSupported()}} / {{unsupportedMetadata()}} 
> reject most real-world schemas and configurations, so the allocation and 
> throughput wins do not apply where they matter most:
> # *Multi-cell columns* — non-frozen collections and UDTs are unsupported. Any 
> table with a {{map}}, {{set}}, {{list}}, or non-frozen UDT column falls back.
> # *BTI output format* — the cursor only writes the BIG format. Clusters on 
> the trie format (the trunk default direction) never use the cursor path.
> # *Partial-range scanners* — token-subrange compactions fall back.
> # *Counters* — counter tables fall back.
> # *Garbage-skipping modes* — {{tombstoneOption != NONE}} falls back (default 
> is NONE, so this is lower priority).
> Out of scope, by design: secondary indexes / SAI (index observers consume 
> materialized rows and need their own design) and pre-current sstable versions 
> (compaction rewrites to the current version, so the gap phases out on its 
> own).
> Equally missing was *verification*: the existing cursor tests asserted each 
> path's output against expected CQL results independently, but never compared 
> the two paths against each other, and never at the file level. Two compaction 
> implementations that must be interchangeable at scale need a stronger bar 
> than "both look right in isolation" — divergence between cursor- and 
> iterator-compacted replicas is silent and permanent.
> h2. What needs to be implemented
> Close the support gaps incrementally — complex columns, then BTI output, then 
> partial-range scanners, then counters, with the garbage-skipper equivalent 
> optional — with each increment landing only when it passes the verification 
> bar below. Wide schemas (>64-column subset encodings), materialized-view 
> tables (strict liveness), and multi-gigabyte partitions are part of the 
> supported surface and must be covered, not special-cased away.
> Two standing constraints:
> * *Garbage-free is the point of the feature.* Every change must preserve the 
> no-per-element-allocation property of the hot path, enforced by measurement 
> rather than code review alone.
> * *Behavior must be identical, not merely equivalent.* Purge decisions, 
> tie-breaks, stats, index structure — everything.
> h2. Required: byte-for-byte differential verification
> The two compaction implementations must be interchangeable, and the way to 
> prove that is a test harness that compacts the same inputs through both 
> production pipelines and compares the outputs *byte for byte* — every 
> component of every output sstable, in both sstable formats. Any component 
> permitted to diverge needs an explicit, justified exception; the expectation 
> is that none are needed. (Experience so far supports the bar: byte 
> divergences found this way have been bugs, not justified differences.)
> Coverage should span the supported surface — the tombstone, TTL, and 
> timestamp edge cases where merge implementations classically disagree, 
> randomized schemas and workloads, and large-scale shapes (wide schemas, 
> many-sstable merges, multi-gigabyte partitions) — and should grow with each 
> increment so a closed gap can never silently fall back or regress. The 
> harness design beyond that is an implementation detail of the patch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to