Hello Chris. HLL sketches from the same project (Apache DataSketches) have already been integrated in Spark.
How does your proposal fit given what I just mentioned? - Menelaos > On Jun 3, 2025, at 2:52 PM, Boumalhab, Chris <cboum...@amazon.com.INVALID> > wrote: > > Hi all, > > I’d like to start a discussion about adding support for [Apache > DataSketches](https://datasketches.apache.org/) — specifically, Theta and > Tuple Sketches — to Spark SQL and DataFrame APIs. > > ## Motivation > These sketches allow scalable approximate set operations (like distinct > count, unions, intersections, minus) and are well-suited for large-scale > analytics. They are already used in production in systems like Druid, Presto, > and others. > > Integrating them natively into Spark (e.g., as UDAFs or SQL functions) could > offer performance and memory efficiency benefits for use cases such as: > - Large cardinality distinct counts > - Approximate aggregations over streaming/batch data > - Set-based operations across datasets > > ## Proposed Scope > - Add Theta and Tuple Sketch-based UDAFs to Spark SQL > - Optional integration into `spark.sql` functions (e.g., > `approx_count_distinct_sketch`) > - Use Apache DataSketches as a dependency (already an incubating Apache > project) > - Start as an optional module if core integration is too heavy > > I’m happy to work on a design doc or POC if there’s interest. > > Thanks, > Chris