dtenedor opened a new pull request, #56438:
URL: https://github.com/apache/spark/pull/56438
## What changes were proposed in this pull request?
[DO NOT SUBMIT] This is currently a prototype for discussion purposes only.
More background:
This PR adds an explicit, versioned provenance envelope to the binary
representation
produced by Apache DataSketches-backed approximate sketch functions, so that
incompatible sketches can be detected at union/intersection/merge time
instead of
silently returning wrong answers.
Core pieces:
- **New `SketchEnvelope` codec**
(`sql/catalyst/.../util/SketchEnvelope.scala`): a small,
self-identifying 28-byte little-endian header wrapped around the native
sketch payload.
It records a `SketchProfile` — sketch kind, key encoding, engine origin,
collation id,
ICU major/minor, DataSketches lib version, and a manually-bumped Spark
collation-factory
revision. A fixed magic (`0xDB 0x53 0x4B 0x01`, where byte index 2 =
`0x4B` can never be a
valid DataSketches `FamilyId`) plus a payload-length cross-check makes
detection unambiguous.
- **Backward/forward compatibility:** envelope writes are gated behind a new
config and default
to off. On read, every decode path strips the envelope if present and
passes legacy
(un-enveloped) buffers through unchanged, so already-materialized sketches
keep working with
no rewrite. Centralized unwrapping was added at the shared decode points
(`TupleSketchUtils.heapifySketch`, `ThetaSketchUtils.wrapCompactSketch`)
and at the KLL decode
sites, making all read paths envelope-tolerant.
- **Write-side + detection wiring** across HLL, Theta, ApproxTopK, Tuple
(`tuple_sketch_agg_*`, `tuple_union/intersection/difference[_theta]_*`),
and KLL
(`kll_sketch_agg_*`, `kll_merge_agg_*`, scalar `kll_*`) aggregates and
scalar functions:
output is wrapped with the current runtime profile (when enabled), inputs
are checked against
the first observed profile at merge points, and provenance is propagated
through set ops.
- **New `sketch_metadata` SQL function** that decodes a sketch buffer's
provenance into a
`STRUCT<kind, key_encoding, collation_id, icu_version,
datasketches_version, engine,
has_envelope>`. Legacy buffers report `has_envelope = false`.
- **Configs** (`SQLConf`): `spark.sql.sketch.envelope.writeEnabled` (default
`false`) and
`spark.sql.sketch.allowVersionMismatch` (default `false`).
- **Error conditions:** `SKETCH_KEY_ENCODING_MISMATCH`,
`SKETCH_COLLATION_MISMATCH`,
`SKETCH_ICU_VERSION_MISMATCH` (with `QueryExecutionErrors` factories).
This PR also fixes a latent header-size bug uncovered by the new unit tests:
the header is
actually 28 bytes (not 24), and the `payload_length` field lives at offset
24 (not 20). The
prior constants would have caused `wrap` to throw `BufferOverflowException`
and `hasEnvelope`
to read the wrong offset the moment the feature was enabled.
## Why are the changes needed?
The serialized bytes of a DataSketches sketch capture the native preamble
but not the
Spark-side provenance that affects correctness when sketches are combined —
most importantly
the string key encoding and the ICU/collation version used to hash string
keys. After an ICU
upgrade or a change to `CollationFactory` semantics, unioning/merging an old
materialized
sketch with a freshly built one can silently produce incorrect
distinct-count/quantile results.
Recording explicit provenance lets Spark detect these incompatibilities and
fail loudly (or
warn), rather than returning wrong answers.
## Does this PR introduce any user-facing change?
Yes, but it is opt-in and backward compatible:
- Two new configs (`spark.sql.sketch.envelope.writeEnabled`,
`spark.sql.sketch.allowVersionMismatch`), both defaulting to the
pre-existing behavior
(envelope writing off).
- A new `sketch_metadata` SQL function.
- New `SKETCH_*` error conditions raised only when an incompatible enveloped
sketch is combined
and `allowVersionMismatch` is false.
- With writes disabled (the default), existing materialized sketches and
query results are
unchanged.
## How was this patch tested?
Built and tested with sbt:
- New `SketchEnvelopeSuite` (18 tests): wrap/unwrap round-tripping across
all sketch kinds ×
key encodings (incl. empty payloads), legacy/short/magic-collision buffer
handling, the full
compatibility policy (key-encoding/collation/ICU hard errors, soft
collation-revision warning,
numeric-encoding exemption, `allowMismatch` suppression), and
`currentProfile`/
`currentItemsProfile`/accessor behavior.
- Catalyst sketch suites — 94/94 passed: `ApproxTopKSuite`,
`DatasketchesHllSketchSuite`,
`ThetasketchesAggSuite`, `ThetaSketchUtilsSuite`, `TupleSketchUtilsSuite`,
`SketchEnvelopeSuite`.
- End-to-end SQL golden-file suites passed via `SQLQueryTestSuite`:
`hll.sql`, `thetasketch.sql`,
`tuplesketch.sql`, `kllquantiles.sql`.
- Regenerated the `sql-expression-schema.md` golden file for the new
`sketch_metadata` function
via `ExpressionsSchemaSuite` with `SPARK_GENERATE_GOLDEN_FILES=1`.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: `claude-opus-4-8-thinking-high`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]