[
https://issues.apache.org/jira/browse/SPARK-52407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Christopher Boumalhab updated SPARK-52407:
------------------------------------------
Description:
*Add Theta Sketches Support to Spark SQL*
This proposal aims to integrate Apache DataSketches' Theta Sketches into Spark
SQL and the DataFrame APIs. While Spark already includes support for HLL
sketches via {{hll_sketch_*}} functions, Theta Sketches provide additional
capabilities not covered by HLL, such as set operations (intersection,
difference).
*Motivation:*
* Enable scalable and memory-efficient approximate set operations for
large-scale analytics.
* Support high-cardinality distinct counts, approximate aggregations over
batch and streaming data, and set-based operations across datasets.
* Leverage Apache DataSketches, which is already an Apache incubating project
and a dependency within Spark.
*Proposed Features:*
* New aggregate functions for Theta Sketches, including:
*
** {{theta_sketch_agg(col)}} — build Theta sketches
*
** {{theta_union(sketch1, sketch2)}} and {{theta_union_agg(sketch_col)}} —
union operations
*
** {{theta_intersection(sketch1, sketch2)}} and
{{theta_intersection_agg(sketch_col)}} — intersection operations
*
** {{theta_difference(sketch1, sketch2)}} — difference operation
*
** {{theta_sketch_estimate(sketch)}} — estimate cardinality
* Similar functions to support Tuple Sketches, prioritized after Theta
sketches are integrated.
*Implementation Notes:*
* Follow naming and design conventions established by existing HLL sketch UDFs.
* Engage with the Apache DataSketches community for technical guidance and
cross-project synergy.
This enhancement will enable Spark users to perform advanced approximate
analytics with improved performance and scalability, complementing existing
approximate functions.
was:
*Add Theta Sketches Support to Spark SQL*
This proposal aims to integrate Apache DataSketches' Theta Sketches into Spark
SQL and the DataFrame APIs. While Spark already includes support for HLL
sketches via {{hll_sketch_*}} functions, Theta Sketches provide additional
capabilities not covered by HLL, such as set operations (intersection,
difference) and value aggregation by key.
*Motivation:*
* Enable scalable and memory-efficient approximate set operations for
large-scale analytics.
* Support high-cardinality distinct counts, approximate aggregations over
batch and streaming data, and set-based operations across datasets.
* Leverage Apache DataSketches, which is already an Apache incubating project
and a dependency within Spark.
*Proposed Features:*
* New aggregate functions for Theta Sketches, including:
** {{theta_sketch_agg(col)}} — build Theta sketches
** {{theta_union(sketch1, sketch2)}} and {{theta_union_agg(sketch_col)}} —
union operations
** {{theta_intersection(sketch1, sketch2)}} and
{{theta_intersection_agg(sketch_col)}} — intersection operations
** {{theta_difference(sketch1, sketch2)}} — difference operation
** {{theta_sketch_estimate(sketch)}} — estimate cardinality
* Similar functions to support Tuple Sketches, prioritized after Theta
sketches are integrated.
*Implementation Notes:*
* Follow naming and design conventions established by existing HLL sketch UDFs.
* Engage with the Apache DataSketches community for technical guidance and
cross-project synergy.
This enhancement will enable Spark users to perform advanced approximate
analytics with improved performance and scalability, complementing existing
approximate functions.
> Add Native Support for Apache Theta Sketches
> --------------------------------------------
>
> Key: SPARK-52407
> URL: https://issues.apache.org/jira/browse/SPARK-52407
> Project: Spark
> Issue Type: New Feature
> Components: PySpark, SQL
> Affects Versions: 3.4.4
> Reporter: Christopher Boumalhab
> Priority: Minor
> Labels: datasketches
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> *Add Theta Sketches Support to Spark SQL*
> This proposal aims to integrate Apache DataSketches' Theta Sketches into
> Spark SQL and the DataFrame APIs. While Spark already includes support for
> HLL sketches via {{hll_sketch_*}} functions, Theta Sketches provide
> additional capabilities not covered by HLL, such as set operations
> (intersection, difference).
> *Motivation:*
> * Enable scalable and memory-efficient approximate set operations for
> large-scale analytics.
> * Support high-cardinality distinct counts, approximate aggregations over
> batch and streaming data, and set-based operations across datasets.
> * Leverage Apache DataSketches, which is already an Apache incubating
> project and a dependency within Spark.
> *Proposed Features:*
> * New aggregate functions for Theta Sketches, including:
> *
> ** {{theta_sketch_agg(col)}} — build Theta sketches
> *
> ** {{theta_union(sketch1, sketch2)}} and {{theta_union_agg(sketch_col)}} —
> union operations
> *
> ** {{theta_intersection(sketch1, sketch2)}} and
> {{theta_intersection_agg(sketch_col)}} — intersection operations
> *
> ** {{theta_difference(sketch1, sketch2)}} — difference operation
> *
> ** {{theta_sketch_estimate(sketch)}} — estimate cardinality
> * Similar functions to support Tuple Sketches, prioritized after Theta
> sketches are integrated.
> *Implementation Notes:*
> * Follow naming and design conventions established by existing HLL sketch
> UDFs.
> * Engage with the Apache DataSketches community for technical guidance and
> cross-project synergy.
> This enhancement will enable Spark users to perform advanced approximate
> analytics with improved performance and scalability, complementing existing
> approximate functions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]