[PR] [SPARK-54576][SQL] Add documentation for new Datasketches-based aggregate functions [spark]

via GitHub Tue, 02 Dec 2025 17:25:47 -0800


dtenedor opened a new pull request, #53297:
URL: https://github.com/apache/spark/pull/53297


   ### What changes were proposed in this pull request?
   
   This PR adds comprehensive documentation for Spark SQL's sketch-based 
approximate functions powered by the Apache DataSketches library. The new 
documentation page (`sql-ref-sketch-aggregates.md`) covers:
   
   **Function Reference:**
   - **HyperLogLog (HLL) Sketch Functions**: `hll_sketch_agg`, `hll_union_agg`, 
`hll_sketch_estimate`, `hll_union`
   - **Theta Sketch Functions**: `theta_sketch_agg`, `theta_union_agg`, 
`theta_intersection_agg`, `theta_sketch_estimate`, `theta_union`, 
`theta_intersection`, `theta_difference`
   - **KLL Quantile Sketch Functions**: `kll_sketch_agg_*`, 
`kll_sketch_to_string_*`, `kll_sketch_get_n_*`, `kll_sketch_merge_*`, 
`kll_sketch_get_quantile_*`, `kll_sketch_get_rank_*`
   - **Approximate Top-K Functions**: `approx_top_k_accumulate`, 
`approx_top_k_combine`, `approx_top_k_estimate`
   
   **Best Practices:**
   - Guidance on choosing between HLL and Theta sketches
   - Accuracy vs. memory trade-offs for each sketch type
   - Tips for storing and reusing sketches
   
   **Common Use Cases and Examples:**
   - Tracking daily unique users with HLL sketches (ETL workflow)
   - Computing percentiles over time with KLL sketches
   - Set operations with Theta sketches (intersection, difference for cohort 
analysis)
   - Finding trending items with Top-K sketches
   
   The PR also adds links to this new documentation page from:
   - `sql-ref-functions.md` (under Aggregate-like Functions)
   - `sql-ref.md` (under Functions section)
   - `_data/menu-sql.yaml` (navigation menu)
   
   ### Why are the changes needed?
   
   Spark SQL has added several sketch-based approximate functions using the 
Apache DataSketches library (HLL sketches in 3.5.0, Theta/KLL/Top-K sketches in 
4.1.0), but there was no comprehensive documentation explaining:
   - How to use these functions together in practical ETL workflows
   - How to store sketches and merge them across multiple data batches
   - Best practices for choosing the right sketch type and tuning accuracy 
parameters
   
   This documentation fills that gap and helps users understand the full power 
of sketch-based analytics in Spark SQL.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this PR adds new documentation pages that are user-facing. No code 
changes are included.
   
   ### How was this patch tested?
   
   Documentation-only change. The examples were verified against the existing 
function implementations and test cases in the codebase.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, code assistance with `claude-4.5-opus-high` in combination with manual 
editing by the author.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-54576][SQL] Add documentation for new Datasketches-based aggregate functions [spark]

Reply via email to