[jira] [Updated] (SPARK-52407) Add Native Support for Apache Theta Sketches

Christopher Boumalhab (Jira) Tue, 15 Jul 2025 07:43:26 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-52407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Christopher Boumalhab updated SPARK-52407:
------------------------------------------
    Description: 
*Add Theta Sketches Support to Spark SQL*

This proposal aims to integrate Apache DataSketches' Theta Sketches into Spark 
SQL and the DataFrame APIs. While Spark already includes support for HLL 
sketches via {{hll_sketch_*}} functions, Theta Sketches provide additional 
capabilities not covered by HLL, such as set operations (intersection, 
difference).

*Motivation:*
 * Enable scalable and memory-efficient approximate set operations for 
large-scale analytics.

 * Support high-cardinality distinct counts, approximate aggregations over 
batch and streaming data, and set-based operations across datasets.

 * Leverage Apache DataSketches, which is already an Apache incubating project 
and a dependency within Spark.

*Proposed Features:*
 * New aggregate functions for Theta Sketches, including:

 * 
 ** {{theta_sketch_agg(col)}} — build Theta sketches

 * 
 ** {{theta_union(sketch1, sketch2)}} and {{theta_union_agg(sketch_col)}} — 
union operations

 * 
 ** {{theta_intersection(sketch1, sketch2)}} and 
{{theta_intersection_agg(sketch_col)}} — intersection operations

 * 
 ** {{theta_difference(sketch1, sketch2)}} — difference operation

 * 
 ** {{theta_sketch_estimate(sketch)}} — estimate cardinality

 * Similar functions to support Tuple Sketches, prioritized after Theta 
sketches are integrated.

*Implementation Notes:*
 * Follow naming and design conventions established by existing HLL sketch UDFs.

 * Engage with the Apache DataSketches community for technical guidance and 
cross-project synergy.

This enhancement will enable Spark users to perform advanced approximate 
analytics with improved performance and scalability, complementing existing 
approximate functions.

  was:
*Add Theta Sketches Support to Spark SQL*

This proposal aims to integrate Apache DataSketches' Theta Sketches into Spark 
SQL and the DataFrame APIs. While Spark already includes support for HLL 
sketches via {{hll_sketch_*}} functions, Theta Sketches provide additional 
capabilities not covered by HLL, such as set operations (intersection, 
difference) and value aggregation by key.

*Motivation:*
 * Enable scalable and memory-efficient approximate set operations for 
large-scale analytics.

 * Support high-cardinality distinct counts, approximate aggregations over 
batch and streaming data, and set-based operations across datasets.

 * Leverage Apache DataSketches, which is already an Apache incubating project 
and a dependency within Spark.

*Proposed Features:*
 * New aggregate functions for Theta Sketches, including:

 ** {{theta_sketch_agg(col)}} — build Theta sketches

 ** {{theta_union(sketch1, sketch2)}} and {{theta_union_agg(sketch_col)}} — 
union operations

 ** {{theta_intersection(sketch1, sketch2)}} and 
{{theta_intersection_agg(sketch_col)}} — intersection operations

 ** {{theta_difference(sketch1, sketch2)}} — difference operation

 ** {{theta_sketch_estimate(sketch)}} — estimate cardinality

 * Similar functions to support Tuple Sketches, prioritized after Theta 
sketches are integrated.

*Implementation Notes:*
 * Follow naming and design conventions established by existing HLL sketch UDFs.

 * Engage with the Apache DataSketches community for technical guidance and 
cross-project synergy.

This enhancement will enable Spark users to perform advanced approximate 
analytics with improved performance and scalability, complementing existing 
approximate functions.


> Add Native Support for Apache Theta Sketches
> --------------------------------------------
>
>                 Key: SPARK-52407
>                 URL: https://issues.apache.org/jira/browse/SPARK-52407
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, SQL
>    Affects Versions: 3.4.4
>            Reporter: Christopher Boumalhab
>            Priority: Minor
>              Labels: datasketches
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> *Add Theta Sketches Support to Spark SQL*
> This proposal aims to integrate Apache DataSketches' Theta Sketches into 
> Spark SQL and the DataFrame APIs. While Spark already includes support for 
> HLL sketches via {{hll_sketch_*}} functions, Theta Sketches provide 
> additional capabilities not covered by HLL, such as set operations 
> (intersection, difference).
> *Motivation:*
>  * Enable scalable and memory-efficient approximate set operations for 
> large-scale analytics.
>  * Support high-cardinality distinct counts, approximate aggregations over 
> batch and streaming data, and set-based operations across datasets.
>  * Leverage Apache DataSketches, which is already an Apache incubating 
> project and a dependency within Spark.
> *Proposed Features:*
>  * New aggregate functions for Theta Sketches, including:
>  * 
>  ** {{theta_sketch_agg(col)}} — build Theta sketches
>  * 
>  ** {{theta_union(sketch1, sketch2)}} and {{theta_union_agg(sketch_col)}} — 
> union operations
>  * 
>  ** {{theta_intersection(sketch1, sketch2)}} and 
> {{theta_intersection_agg(sketch_col)}} — intersection operations
>  * 
>  ** {{theta_difference(sketch1, sketch2)}} — difference operation
>  * 
>  ** {{theta_sketch_estimate(sketch)}} — estimate cardinality
>  * Similar functions to support Tuple Sketches, prioritized after Theta 
> sketches are integrated.
> *Implementation Notes:*
>  * Follow naming and design conventions established by existing HLL sketch 
> UDFs.
>  * Engage with the Apache DataSketches community for technical guidance and 
> cross-project synergy.
> This enhancement will enable Spark users to perform advanced approximate 
> analytics with improved performance and scalability, complementing existing 
> approximate functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-52407) Add Native Support for Apache Theta Sketches

Reply via email to