[PR] [SPARK-53991][SQL] Add SQL support for KLL quantiles functions based on DataSketches [spark]

via GitHub Thu, 30 Oct 2025 12:48:34 -0700


dtenedor opened a new pull request, #52800:
URL: https://github.com/apache/spark/pull/52800


   ### What changes were proposed in this pull request?
   
   This PR adds support for KLL (K-Linear-Logarithmic) quantile sketches to 
Spark SQL, based on the Apache DataSketches KLL library. KLL sketches provide a 
compact, approximate representation of data distributions, enabling efficient 
quantile estimation and rank queries on large datasets with bounded memory 
usage and strong accuracy guarantees.
   
   It introduces 15 new SQL functions organized into five categories:
   
   1. Aggregation Functions
   kll_sketch_agg_bigint(col) - Creates a KLL sketch from 
BIGINT/TINYINT/SMALLINT/INT values
   kll_sketch_agg_float(col) - Creates a KLL sketch from FLOAT values
   kll_sketch_agg_double(col) - Creates a KLL sketch from DOUBLE values
   
   2. Sketch Inspection Functions
   kll_sketch_to_string_bigint(sketch) - Returns a human-readable string 
representation
   kll_sketch_to_string_float(sketch) - Returns a human-readable string 
representation
   kll_sketch_to_string_double(sketch) - Returns a human-readable string 
representation
   
   3. Sketch Merging Functions
   kll_sketch_merge_bigint(sketch1, sketch2) - Merges two compatible sketches
   kll_sketch_merge_float(sketch1, sketch2) - Merges two compatible sketches
   kll_sketch_merge_double(sketch1, sketch2) - Merges two compatible sketches
   
   4. Quantile Estimation Functions
   kll_sketch_get_quantile_bigint(sketch, rank) - Estimates the value at a 
given rank (0.0-1.0)
   kll_sketch_get_quantile_float(sketch, rank) - Estimates the value at a given 
rank
   kll_sketch_get_quantile_double(sketch, rank) - Estimates the value at a 
given rank
   Supports both single rank values and arrays of ranks for batch quantile 
queries.
   
   5. Rank Estimation Functions
   kll_sketch_get_rank_bigint(sketch, value) - Estimates the rank (0.0-1.0) of 
a given value
   kll_sketch_get_rank_float(sketch, value) - Estimates the rank of a given 
value
   kll_sketch_get_rank_double(sketch, value) - Estimates the rank of a given 
value
   Supports both single values and arrays of values for batch rank queries.
   
   This PR only includes SQL language support; Dataframe API support will 
follow in a separate PR.
   
   Key Features:
   * Type Safety: Separate implementations for BIGINT (covering 
TINYINT/SMALLINT/INT), FLOAT, and DOUBLE types ensure type-safe operations
   * Array Support: Quantile and rank functions accept arrays for efficient 
batch operations
   * Memory Efficient: Sketches are serialized to BINARY type for compact 
storage and efficient shuffling
   * NULL Handling: All aggregate functions properly ignore NULL input values, 
consistent with standard SQL aggregate behavior
   * Error Handling: Comprehensive validation with structured error messages 
for: invalid quantile ranges (must be 0.0-1.0), incompatible sketch merges, 
invalid binary sketch data, type mismatches
   
   ### Why are the changes needed?
   
   KLL sketches enable approximate quantile and rank queries on large datasets 
with:
   * O(1) space complexity - Bounded memory usage regardless of data size
   * High accuracy - Configurable error bounds with proven theoretical 
guarantees
   * Fast queries - O(log n) query time for quantile/rank estimation
   * Mergeable - Sketches can be combined for distributed aggregation
   
   Use cases include:
   * Approximate median/percentile calculations on massive datasets
   * Distribution analysis for monitoring and analytics
   * SLA compliance checking (e.g., p95, p99 latency)
   * Efficient histogram generation
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this PR introduces 15 new SQL functions available in Spark SQL.
   
   ### How was this patch tested?
   
   SQL Golden File Tests: Added `kllquantiles.sql` with 378 lines of test 
queries covering:
   * All three data types (BIGINT, FLOAT, DOUBLE)
   * Multiple input sizes (empty, single value, multiple values)
   * NULL value handling (verified NULLs are ignored)
   * Quantile estimation (single and array inputs)
   * Rank estimation (single and array inputs)
   * Sketch merging
   * Approximate result validation using tolerance-based comparisons
   * Negative tests for error conditions (invalid quantiles, type mismatches, 
incompatible merges)
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, code assistance with `claude-4.5-sonnet` in combination with manual 
editing by the author.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-53991][SQL] Add SQL support for KLL quantiles functions based on DataSketches [spark]

Reply via email to