Jon Haddad created CASSANDRA-21452:
--------------------------------------
Summary: Complete cursor compaction coverage (complex columns,
BTI, wide schemas), verified byte-for-byte against the iterator path
Key: CASSANDRA-21452
URL: https://issues.apache.org/jira/browse/CASSANDRA-21452
Project: Apache Cassandra
Issue Type: Improvement
Components: Local/Compaction
Reporter: Jon Haddad
h2. Background
CASSANDRA-20918 introduced cursor-based compaction ({{CursorCompactor}},
{{SSTableCursorReader}}, {{SSTableCursorWriter}}): a streaming merge that
avoids the per-partition/row/cell object materialization of the iterator path
({{CompactionIterator}}). Reported results: 2–5x faster compaction, up to ~100x
allocation reduction, with a fixed memory footprint per sstable.
The motivating problem is CASSANDRA-20428: {{ByteArrayAccessor.read}} — the
{{new byte[length]}} per value performed by cell and clustering deserialization
— is a dominant allocation source wherever the iterator-based read machinery
runs, and compaction pays it for every cell of every sstable it merges. The
cursor path eliminates that allocation class entirely, but only for the schemas
and configurations it supports; everything else silently falls back to the
iterator path at {{AbstractCompactionPipeline.create}}.
h2. What was missing
As merged, {{CursorCompactor.isSupported()}} / {{unsupportedMetadata()}} reject
most real-world schemas and configurations, so the allocation and throughput
wins do not apply where they matter most:
# *Multi-cell columns* — non-frozen collections and UDTs are unsupported. Any
table with a {{map}}, {{set}}, {{list}}, or non-frozen UDT column falls back.
# *BTI output format* — the cursor only writes the BIG format. Clusters on the
trie format (the trunk default direction) never use the cursor path.
# *Partial-range scanners* — token-subrange compactions fall back.
# *Counters* — counter tables fall back.
# *Garbage-skipping modes* — {{tombstoneOption != NONE}} falls back (default is
NONE, so this is lower priority).
Out of scope, by design: secondary indexes / SAI (index observers consume
materialized rows and need their own design) and pre-current sstable versions
(compaction rewrites to the current version, so the gap phases out on its own).
Equally missing was *verification*: the existing cursor tests asserted each
path's output against expected CQL results independently, but never compared
the two paths against each other, and never at the file level. Two compaction
implementations that must be interchangeable at scale need a stronger bar than
"both look right in isolation" — divergence between cursor- and
iterator-compacted replicas is silent and permanent.
h2. What needs to be implemented
Close the support gaps incrementally — complex columns, then BTI output, then
partial-range scanners, then counters, with the garbage-skipper equivalent
optional — with each increment landing only when it passes the verification bar
below. Wide schemas (>64-column subset encodings), materialized-view tables
(strict liveness), and multi-gigabyte partitions are part of the supported
surface and must be covered, not special-cased away.
Two standing constraints:
* *Garbage-free is the point of the feature.* Every change must preserve the
no-per-element-allocation property of the hot path, enforced by measurement
rather than code review alone.
* *Behavior must be identical, not merely equivalent.* Purge decisions,
tie-breaks, stats, index structure — everything.
h2. Required: byte-for-byte differential verification
The two compaction implementations must be interchangeable, and the way to
prove that is a test harness that compacts the same inputs through both
production pipelines and compares the outputs *byte for byte* — every component
of every output sstable, in both sstable formats. Any component permitted to
diverge needs an explicit, justified exception; the expectation is that none
are needed. (Experience so far supports the bar: byte divergences found this
way have been bugs, not justified differences.)
Coverage should span the supported surface — the tombstone, TTL, and timestamp
edge cases where merge implementations classically disagree, randomized schemas
and workloads, and large-scale shapes (wide schemas, many-sstable merges,
multi-gigabyte partitions) — and should grow with each increment so a closed
gap can never silently fall back or regress. The harness design beyond that is
an implementation detail of the patch.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]