dtenedor opened a new pull request, #56438:
URL: https://github.com/apache/spark/pull/56438

   ## What changes were proposed in this pull request?
   
   [DO NOT SUBMIT] This is currently a prototype for discussion purposes only.
   
   More background:
   
   This PR adds an explicit, versioned provenance envelope to the binary 
representation
   produced by Apache DataSketches-backed approximate sketch functions, so that
   incompatible sketches can be detected at union/intersection/merge time 
instead of
   silently returning wrong answers.
   Core pieces:
   - **New `SketchEnvelope` codec** 
(`sql/catalyst/.../util/SketchEnvelope.scala`): a small,
     self-identifying 28-byte little-endian header wrapped around the native 
sketch payload.
     It records a `SketchProfile` — sketch kind, key encoding, engine origin, 
collation id,
     ICU major/minor, DataSketches lib version, and a manually-bumped Spark 
collation-factory
     revision. A fixed magic (`0xDB 0x53 0x4B 0x01`, where byte index 2 = 
`0x4B` can never be a
     valid DataSketches `FamilyId`) plus a payload-length cross-check makes 
detection unambiguous.
   - **Backward/forward compatibility:** envelope writes are gated behind a new 
config and default
     to off. On read, every decode path strips the envelope if present and 
passes legacy
     (un-enveloped) buffers through unchanged, so already-materialized sketches 
keep working with
     no rewrite. Centralized unwrapping was added at the shared decode points
     (`TupleSketchUtils.heapifySketch`, `ThetaSketchUtils.wrapCompactSketch`) 
and at the KLL decode
     sites, making all read paths envelope-tolerant.
   - **Write-side + detection wiring** across HLL, Theta, ApproxTopK, Tuple
     (`tuple_sketch_agg_*`, `tuple_union/intersection/difference[_theta]_*`), 
and KLL
     (`kll_sketch_agg_*`, `kll_merge_agg_*`, scalar `kll_*`) aggregates and 
scalar functions:
     output is wrapped with the current runtime profile (when enabled), inputs 
are checked against
     the first observed profile at merge points, and provenance is propagated 
through set ops.
   - **New `sketch_metadata` SQL function** that decodes a sketch buffer's 
provenance into a
     `STRUCT<kind, key_encoding, collation_id, icu_version, 
datasketches_version, engine,
     has_envelope>`. Legacy buffers report `has_envelope = false`.
   - **Configs** (`SQLConf`): `spark.sql.sketch.envelope.writeEnabled` (default 
`false`) and
     `spark.sql.sketch.allowVersionMismatch` (default `false`).
   - **Error conditions:** `SKETCH_KEY_ENCODING_MISMATCH`, 
`SKETCH_COLLATION_MISMATCH`,
     `SKETCH_ICU_VERSION_MISMATCH` (with `QueryExecutionErrors` factories).
   This PR also fixes a latent header-size bug uncovered by the new unit tests: 
the header is
   actually 28 bytes (not 24), and the `payload_length` field lives at offset 
24 (not 20). The
   prior constants would have caused `wrap` to throw `BufferOverflowException` 
and `hasEnvelope`
   to read the wrong offset the moment the feature was enabled.
   
   ## Why are the changes needed?
   
   The serialized bytes of a DataSketches sketch capture the native preamble 
but not the
   Spark-side provenance that affects correctness when sketches are combined — 
most importantly
   the string key encoding and the ICU/collation version used to hash string 
keys. After an ICU
   upgrade or a change to `CollationFactory` semantics, unioning/merging an old 
materialized
   sketch with a freshly built one can silently produce incorrect 
distinct-count/quantile results.
   Recording explicit provenance lets Spark detect these incompatibilities and 
fail loudly (or
   warn), rather than returning wrong answers.
   ## Does this PR introduce any user-facing change?
   Yes, but it is opt-in and backward compatible:
   - Two new configs (`spark.sql.sketch.envelope.writeEnabled`,
     `spark.sql.sketch.allowVersionMismatch`), both defaulting to the 
pre-existing behavior
     (envelope writing off).
   - A new `sketch_metadata` SQL function.
   - New `SKETCH_*` error conditions raised only when an incompatible enveloped 
sketch is combined
     and `allowVersionMismatch` is false.
   - With writes disabled (the default), existing materialized sketches and 
query results are
     unchanged.
   
   ## How was this patch tested?
   
   Built and tested with sbt:
   - New `SketchEnvelopeSuite` (18 tests): wrap/unwrap round-tripping across 
all sketch kinds ×
     key encodings (incl. empty payloads), legacy/short/magic-collision buffer 
handling, the full
     compatibility policy (key-encoding/collation/ICU hard errors, soft 
collation-revision warning,
     numeric-encoding exemption, `allowMismatch` suppression), and 
`currentProfile`/
     `currentItemsProfile`/accessor behavior.
   - Catalyst sketch suites — 94/94 passed: `ApproxTopKSuite`, 
`DatasketchesHllSketchSuite`,
     `ThetasketchesAggSuite`, `ThetaSketchUtilsSuite`, `TupleSketchUtilsSuite`, 
`SketchEnvelopeSuite`.
   - End-to-end SQL golden-file suites passed via `SQLQueryTestSuite`: 
`hll.sql`, `thetasketch.sql`,
     `tuplesketch.sql`, `kllquantiles.sql`.
   - Regenerated the `sql-expression-schema.md` golden file for the new 
`sketch_metadata` function
     via `ExpressionsSchemaSuite` with `SPARK_GENERATE_GOLDEN_FILES=1`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: `claude-opus-4-8-thinking-high`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to