Hello Chris.

HLL sketches from the same project (Apache DataSketches) have already been 
integrated in Spark.

How does your proposal fit given what I just mentioned?

- Menelaos

> On Jun 3, 2025, at 2:52 PM, Boumalhab, Chris <cboum...@amazon.com.INVALID> 
> wrote:
> 
> Hi all,
>  
> I’d like to start a discussion about adding support for [Apache 
> DataSketches](https://datasketches.apache.org/) — specifically, Theta and 
> Tuple Sketches — to Spark SQL and DataFrame APIs.
>  
> ## Motivation
> These sketches allow scalable approximate set operations (like distinct 
> count, unions, intersections, minus) and are well-suited for large-scale 
> analytics. They are already used in production in systems like Druid, Presto, 
> and others.
>  
> Integrating them natively into Spark (e.g., as UDAFs or SQL functions) could 
> offer performance and memory efficiency benefits for use cases such as:
> - Large cardinality distinct counts
> - Approximate aggregations over streaming/batch data
> - Set-based operations across datasets
>  
> ## Proposed Scope
> - Add Theta and Tuple Sketch-based UDAFs to Spark SQL
> - Optional integration into `spark.sql` functions (e.g., 
> `approx_count_distinct_sketch`)
> - Use Apache DataSketches as a dependency (already an incubating Apache 
> project)
> - Start as an optional module if core integration is too heavy
>  
> I’m happy to work on a design doc or POC if there’s interest.
>  
> Thanks,  
> Chris

Reply via email to