[ 
https://issues.apache.org/jira/browse/CASSANDRA-21452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088592#comment-18088592
 ] 

Jon Haddad commented on CASSANDRA-21452:
----------------------------------------

I can definitely break this up.  It's spread over dozens of commits, with a lot 
of them being single fixes.  I think it would be good to get the test harness 
in early, because it exposes a lot of issues and can be useful in detecting 
regressions, but without all the supporting fixes it'll be failing.

Would you rather see the byte for byte comparison work go in early & fail, or 
the individual fixes go in first with the test to follow?

> Complete cursor compaction coverage (complex columns, BTI, wide schemas), 
> verified byte-for-byte against the iterator path
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21452
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21452
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction
>            Reporter: Jon Haddad
>            Assignee: Jon Haddad
>            Priority: Normal
>
> h2. Background
> CASSANDRA-20918 introduced cursor-based compaction ({{CursorCompactor}}, 
> {{SSTableCursorReader}}, {{SSTableCursorWriter}}): a streaming merge that 
> avoids the per-partition/row/cell object materialization of the iterator path 
> ({{CompactionIterator}}). Reported results: 2–5x faster compaction, up to 
> ~100x allocation reduction, with a fixed memory footprint per sstable.
> The motivating problem is CASSANDRA-20428: {{ByteArrayAccessor.read}} — the 
> {{new byte[length]}} per value performed by cell and clustering 
> deserialization — is a dominant allocation source wherever the iterator-based 
> read machinery runs, and compaction pays it for every cell of every sstable 
> it merges. The cursor path eliminates that allocation class entirely, but 
> only for the schemas and configurations it supports; everything else silently 
> falls back to the iterator path at {{AbstractCompactionPipeline.create}}.
> h2. What was missing
> As merged, {{CursorCompactor.isSupported()}} / {{unsupportedMetadata()}} 
> reject most real-world schemas and configurations, so the allocation and 
> throughput wins do not apply where they matter most:
> # *Multi-cell columns* — non-frozen collections and UDTs are unsupported. Any 
> table with a {{map}}, {{set}}, {{list}}, or non-frozen UDT column falls back.
> # *BTI output format* — the cursor only writes the BIG format. Clusters on 
> the trie format (the trunk default direction) never use the cursor path.
> # *Partial-range scanners* — token-subrange compactions fall back.
> # *Counters* — counter tables fall back.
> # *Garbage-skipping modes* — {{tombstoneOption != NONE}} falls back (default 
> is NONE, so this is lower priority).
> Out of scope, by design: secondary indexes / SAI (index observers consume 
> materialized rows and need their own design) and pre-current sstable versions 
> (compaction rewrites to the current version, so the gap phases out on its 
> own).
> Equally missing was *verification*: the existing cursor tests asserted each 
> path's output against expected CQL results independently, but never compared 
> the two paths against each other, and never at the file level. Two compaction 
> implementations that must be interchangeable at scale need a stronger bar 
> than "both look right in isolation" — divergence between cursor- and 
> iterator-compacted replicas is silent and permanent.
> h2. What needs to be implemented
> Close the support gaps incrementally — complex columns, then BTI output, then 
> partial-range scanners, then counters, with the garbage-skipper equivalent 
> optional — with each increment landing only when it passes the verification 
> bar below. Wide schemas (>64-column subset encodings), materialized-view 
> tables (strict liveness), and multi-gigabyte partitions are part of the 
> supported surface and must be covered, not special-cased away.
> Two standing constraints:
> * *Garbage-free is the point of the feature.* Every change must preserve the 
> no-per-element-allocation property of the hot path, enforced by measurement 
> rather than code review alone.
> * *Behavior must be identical, not merely equivalent.* Purge decisions, 
> tie-breaks, stats, index structure — everything.
> h2. Required: byte-for-byte differential verification
> The two compaction implementations must be interchangeable, and the way to 
> prove that is a test harness that compacts the same inputs through both 
> production pipelines and compares the outputs *byte for byte* — every 
> component of every output sstable, in both sstable formats. Any component 
> permitted to diverge needs an explicit, justified exception; the expectation 
> is that none are needed. (Experience so far supports the bar: byte 
> divergences found this way have been bugs, not justified differences.)
> Coverage should span the supported surface — the tombstone, TTL, and 
> timestamp edge cases where merge implementations classically disagree, 
> randomized schemas and workloads, and large-scale shapes (wide schemas, 
> many-sstable merges, multi-gigabyte partitions) — and should grow with each 
> increment so a closed gap can never silently fall back or regress. The 
> harness design beyond that is an implementation detail of the patch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to