dtenedor opened a new pull request, #53297: URL: https://github.com/apache/spark/pull/53297
### What changes were proposed in this pull request? This PR adds comprehensive documentation for Spark SQL's sketch-based approximate functions powered by the Apache DataSketches library. The new documentation page (`sql-ref-sketch-aggregates.md`) covers: **Function Reference:** - **HyperLogLog (HLL) Sketch Functions**: `hll_sketch_agg`, `hll_union_agg`, `hll_sketch_estimate`, `hll_union` - **Theta Sketch Functions**: `theta_sketch_agg`, `theta_union_agg`, `theta_intersection_agg`, `theta_sketch_estimate`, `theta_union`, `theta_intersection`, `theta_difference` - **KLL Quantile Sketch Functions**: `kll_sketch_agg_*`, `kll_sketch_to_string_*`, `kll_sketch_get_n_*`, `kll_sketch_merge_*`, `kll_sketch_get_quantile_*`, `kll_sketch_get_rank_*` - **Approximate Top-K Functions**: `approx_top_k_accumulate`, `approx_top_k_combine`, `approx_top_k_estimate` **Best Practices:** - Guidance on choosing between HLL and Theta sketches - Accuracy vs. memory trade-offs for each sketch type - Tips for storing and reusing sketches **Common Use Cases and Examples:** - Tracking daily unique users with HLL sketches (ETL workflow) - Computing percentiles over time with KLL sketches - Set operations with Theta sketches (intersection, difference for cohort analysis) - Finding trending items with Top-K sketches The PR also adds links to this new documentation page from: - `sql-ref-functions.md` (under Aggregate-like Functions) - `sql-ref.md` (under Functions section) - `_data/menu-sql.yaml` (navigation menu) ### Why are the changes needed? Spark SQL has added several sketch-based approximate functions using the Apache DataSketches library (HLL sketches in 3.5.0, Theta/KLL/Top-K sketches in 4.1.0), but there was no comprehensive documentation explaining: - How to use these functions together in practical ETL workflows - How to store sketches and merge them across multiple data batches - Best practices for choosing the right sketch type and tuning accuracy parameters This documentation fills that gap and helps users understand the full power of sketch-based analytics in Spark SQL. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds new documentation pages that are user-facing. No code changes are included. ### How was this patch tested? Documentation-only change. The examples were verified against the existing function implementations and test cases in the codebase. ### Was this patch authored or co-authored using generative AI tooling? Yes, code assistance with `claude-4.5-opus-high` in combination with manual editing by the author. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
