dtenedor opened a new pull request, #52800: URL: https://github.com/apache/spark/pull/52800
### What changes were proposed in this pull request? This PR adds support for KLL (K-Linear-Logarithmic) quantile sketches to Spark SQL, based on the Apache DataSketches KLL library. KLL sketches provide a compact, approximate representation of data distributions, enabling efficient quantile estimation and rank queries on large datasets with bounded memory usage and strong accuracy guarantees. It introduces 15 new SQL functions organized into five categories: 1. Aggregation Functions kll_sketch_agg_bigint(col) - Creates a KLL sketch from BIGINT/TINYINT/SMALLINT/INT values kll_sketch_agg_float(col) - Creates a KLL sketch from FLOAT values kll_sketch_agg_double(col) - Creates a KLL sketch from DOUBLE values 2. Sketch Inspection Functions kll_sketch_to_string_bigint(sketch) - Returns a human-readable string representation kll_sketch_to_string_float(sketch) - Returns a human-readable string representation kll_sketch_to_string_double(sketch) - Returns a human-readable string representation 3. Sketch Merging Functions kll_sketch_merge_bigint(sketch1, sketch2) - Merges two compatible sketches kll_sketch_merge_float(sketch1, sketch2) - Merges two compatible sketches kll_sketch_merge_double(sketch1, sketch2) - Merges two compatible sketches 4. Quantile Estimation Functions kll_sketch_get_quantile_bigint(sketch, rank) - Estimates the value at a given rank (0.0-1.0) kll_sketch_get_quantile_float(sketch, rank) - Estimates the value at a given rank kll_sketch_get_quantile_double(sketch, rank) - Estimates the value at a given rank Supports both single rank values and arrays of ranks for batch quantile queries. 5. Rank Estimation Functions kll_sketch_get_rank_bigint(sketch, value) - Estimates the rank (0.0-1.0) of a given value kll_sketch_get_rank_float(sketch, value) - Estimates the rank of a given value kll_sketch_get_rank_double(sketch, value) - Estimates the rank of a given value Supports both single values and arrays of values for batch rank queries. This PR only includes SQL language support; Dataframe API support will follow in a separate PR. Key Features: * Type Safety: Separate implementations for BIGINT (covering TINYINT/SMALLINT/INT), FLOAT, and DOUBLE types ensure type-safe operations * Array Support: Quantile and rank functions accept arrays for efficient batch operations * Memory Efficient: Sketches are serialized to BINARY type for compact storage and efficient shuffling * NULL Handling: All aggregate functions properly ignore NULL input values, consistent with standard SQL aggregate behavior * Error Handling: Comprehensive validation with structured error messages for: invalid quantile ranges (must be 0.0-1.0), incompatible sketch merges, invalid binary sketch data, type mismatches ### Why are the changes needed? KLL sketches enable approximate quantile and rank queries on large datasets with: * O(1) space complexity - Bounded memory usage regardless of data size * High accuracy - Configurable error bounds with proven theoretical guarantees * Fast queries - O(log n) query time for quantile/rank estimation * Mergeable - Sketches can be combined for distributed aggregation Use cases include: * Approximate median/percentile calculations on massive datasets * Distribution analysis for monitoring and analytics * SLA compliance checking (e.g., p95, p99 latency) * Efficient histogram generation ### Does this PR introduce _any_ user-facing change? Yes, this PR introduces 15 new SQL functions available in Spark SQL. ### How was this patch tested? SQL Golden File Tests: Added `kllquantiles.sql` with 378 lines of test queries covering: * All three data types (BIGINT, FLOAT, DOUBLE) * Multiple input sizes (empty, single value, multiple values) * NULL value handling (verified NULLs are ignored) * Quantile estimation (single and array inputs) * Rank estimation (single and array inputs) * Sketch merging * Approximate result validation using tolerance-based comparisons * Negative tests for error conditions (invalid quantiles, type mismatches, incompatible merges) ### Was this patch authored or co-authored using generative AI tooling? Yes, code assistance with `claude-4.5-sonnet` in combination with manual editing by the author. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
