Hi Chris, We integrated DataSketches into Spark when we introduced the hll_sketch_* UDFs - see the PR from 2023 <https://github.com/apache/spark/pull/40615> for more info. I'm sure there'd be interest in exposing other types of sketches, and I bet there'd be some potential for code-reuse between the various sketch implementations!
Ryan Berti Senior Data Engineer | Content & Studio DE M 7023217573 5808 W Sunset Blvd | Los Angeles, CA 90028 On Tue, Jun 3, 2025 at 2:53 PM Boumalhab, Chris <cboum...@amazon.com.invalid> wrote: > Hi all, > > > > I’d like to start a discussion about adding support for [Apache > DataSketches](https://datasketches.apache.org/) — specifically, Theta and > Tuple Sketches — to Spark SQL and DataFrame APIs. > > > > ## Motivation > > These sketches allow scalable approximate set operations (like distinct > count, unions, intersections, minus) and are well-suited for large-scale > analytics. They are already used in production in systems like Druid, > Presto, and others. > > > > Integrating them natively into Spark (e.g., as UDAFs or SQL functions) > could offer performance and memory efficiency benefits for use cases such > as: > > - Large cardinality distinct counts > > - Approximate aggregations over streaming/batch data > > - Set-based operations across datasets > > > > ## Proposed Scope > > - Add Theta and Tuple Sketch-based UDAFs to Spark SQL > > - Optional integration into `spark.sql` functions (e.g., > `approx_count_distinct_sketch`) > > - Use Apache DataSketches as a dependency (already an incubating Apache > project) > > - Start as an optional module if core integration is too heavy > > > > I’m happy to work on a design doc or POC if there’s interest. > > > > Thanks, > > Chris >