[ https://issues.apache.org/jira/browse/SPARK-52407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christopher Boumalhab updated SPARK-52407: ------------------------------------------ Shepherd: Daniel Tenedorio (was: Menelaos Karavelas) > Add Native Support for Apache Theta Sketches > -------------------------------------------- > > Key: SPARK-52407 > URL: https://issues.apache.org/jira/browse/SPARK-52407 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL > Affects Versions: 4.1.0 > Reporter: Christopher Boumalhab > Priority: Minor > Labels: datasketches, pull-request-available > Original Estimate: 672h > Remaining Estimate: 672h > > *Add Theta Sketches Support to Spark SQL* > This proposal aims to integrate Apache DataSketches' Theta Sketches into > Spark SQL and the DataFrame APIs. While Spark already includes support for > HLL sketches via {{hll_sketch_*}} functions, Theta Sketches provide > additional capabilities not covered by HLL, such as set operations > (intersection, difference). > *Motivation:* > * Enable scalable and memory-efficient approximate set operations for > large-scale analytics. > * Support high-cardinality distinct counts, approximate aggregations over > batch and streaming data, and set-based operations across datasets. > * Leverage Apache DataSketches, which is already an Apache incubating > project and a dependency within Spark. > *Proposed Features:* > * New aggregate functions for Theta Sketches, including: > * > ** {{theta_sketch_agg(col)}} — build Theta sketches > * > ** {{theta_union(sketch1, sketch2)}} and {{theta_union_agg(sketch_col)}} — > union operations > * > ** {{theta_intersection(sketch1, sketch2)}} and > {{theta_intersection_agg(sketch_col)}} — intersection operations > * > ** {{theta_difference(sketch1, sketch2)}} — difference operation > * > ** {{theta_sketch_estimate(sketch)}} — estimate cardinality > * Similar functions to support Tuple Sketches, prioritized after Theta > sketches are integrated. > *Implementation Notes:* > * Follow naming and design conventions established by existing HLL sketch > UDFs. > * Engage with the Apache DataSketches community for technical guidance and > cross-project synergy. > This enhancement will enable Spark users to perform advanced approximate > analytics with improved performance and scalability, complementing existing > approximate functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org