Yes, HLL sketches do not support the operations you mention, and this is 
actually a good reason to add other types of sketches.

Ryan beat me to answering :) Datasketches is already a dependency, so it should 
make some things easier.

Regarding the user facing functionality, could you please be more specific as 
to what you propose?

There is already “approx_count_distinct", and I am afraid that for example 
"approx_count_distinct_sketch” might be misleading or confusing (which sketch?)

- Menelaos


> On Jun 3, 2025, at 3:33 PM, Boumalhab, Chris <cboum...@amazon.com> wrote:
> 
> Hi Menelaos,
> 
> Thanks for pointing that out. HLL sketches do not support set operations such 
> as intersection or difference. Tuple sketches would also allow value 
> aggregation for the same key. For those reasons, I don’t believe HLL is 
> enough.
>  
> Chris
>  
> From: Menelaos Karavelas <menelaos.karave...@gmail.com 
> <mailto:menelaos.karave...@gmail.com>>
> Date: Tuesday, June 3, 2025 at 6:15 PM
> To: "Boumalhab, Chris" <cboum...@amazon.com.INVALID 
> <mailto:cboum...@amazon.com.INVALID>>
> Cc: "dev@spark.apache.org <mailto:dev@spark.apache.org>" 
> <dev@spark.apache.org <mailto:dev@spark.apache.org>>
> Subject: RE: [EXTERNAL] [DISCUSS] Proposal to Add Theta and Tuple Sketches to 
> Spark SQL
>  
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
>  
> Hello Chris. 
>  
> HLL sketches from the same project (Apache DataSketches) have already been 
> integrated in Spark.
>  
> How does your proposal fit given what I just mentioned?
>  
> - Menelaos
> 
> 
> On Jun 3, 2025, at 2:52 PM, Boumalhab, Chris <cboum...@amazon.com.INVALID> 
> wrote:
>  
> Hi all,
>  
> I’d like to start a discussion about adding support for [Apache 
> DataSketches](https://datasketches.apache.org/) — specifically, Theta and 
> Tuple Sketches — to Spark SQL and DataFrame APIs.
>  
> ## Motivation
> These sketches allow scalable approximate set operations (like distinct 
> count, unions, intersections, minus) and are well-suited for large-scale 
> analytics. They are already used in production in systems like Druid, Presto, 
> and others.
>  
> Integrating them natively into Spark (e.g., as UDAFs or SQL functions) could 
> offer performance and memory efficiency benefits for use cases such as:
> - Large cardinality distinct counts
> - Approximate aggregations over streaming/batch data
> - Set-based operations across datasets
>  
> ## Proposed Scope
> - Add Theta and Tuple Sketch-based UDAFs to Spark SQL
> - Optional integration into `spark.sql` functions (e.g., 
> `approx_count_distinct_sketch`)
> - Use Apache DataSketches as a dependency (already an incubating Apache 
> project)
> - Start as an optional module if core integration is too heavy
>  
> I’m happy to work on a design doc or POC if there’s interest.
>  
> Thanks,  
> Chris

Reply via email to