Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

Menelaos Karavelas Tue, 03 Jun 2025 16:21:24 -0700

Yes, HLL sketches do not support the operations you mention, and this is 
actually a good reason to add other types of sketches.


Ryan beat me to answering :) Datasketches is already a dependency, so it should 
make some things easier.

Regarding the user facing functionality, could you please be more specific as 
to what you propose?

There is already “approx_count_distinct", and I am afraid that for example 
"approx_count_distinct_sketch” might be misleading or confusing (which sketch?)

- Menelaos


> On Jun 3, 2025, at 3:33 PM, Boumalhab, Chris <[email protected]> wrote:
> 
> Hi Menelaos,
> 
> Thanks for pointing that out. HLL sketches do not support set operations such 
> as intersection or difference. Tuple sketches would also allow value 
> aggregation for the same key. For those reasons, I don’t believe HLL is 
> enough.
>  
> Chris
>  
> From: Menelaos Karavelas <[email protected] 
> <mailto:[email protected]>>
> Date: Tuesday, June 3, 2025 at 6:15 PM
> To: "Boumalhab, Chris" <[email protected] 
> <mailto:[email protected]>>
> Cc: "[email protected] <mailto:[email protected]>" 
> <[email protected] <mailto:[email protected]>>
> Subject: RE: [EXTERNAL] [DISCUSS] Proposal to Add Theta and Tuple Sketches to 
> Spark SQL
>  
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
>  
> Hello Chris. 
>  
> HLL sketches from the same project (Apache DataSketches) have already been 
> integrated in Spark.
>  
> How does your proposal fit given what I just mentioned?
>  
> - Menelaos
> 
> 
> On Jun 3, 2025, at 2:52 PM, Boumalhab, Chris <[email protected]> 
> wrote:
>  
> Hi all,
>  
> I’d like to start a discussion about adding support for [Apache 
> DataSketches](https://datasketches.apache.org/) — specifically, Theta and 
> Tuple Sketches — to Spark SQL and DataFrame APIs.
>  
> ## Motivation
> These sketches allow scalable approximate set operations (like distinct 
> count, unions, intersections, minus) and are well-suited for large-scale 
> analytics. They are already used in production in systems like Druid, Presto, 
> and others.
>  
> Integrating them natively into Spark (e.g., as UDAFs or SQL functions) could 
> offer performance and memory efficiency benefits for use cases such as:
> - Large cardinality distinct counts
> - Approximate aggregations over streaming/batch data
> - Set-based operations across datasets
>  
> ## Proposed Scope
> - Add Theta and Tuple Sketch-based UDAFs to Spark SQL
> - Optional integration into `spark.sql` functions (e.g., 
> `approx_count_distinct_sketch`)
> - Use Apache DataSketches as a dependency (already an incubating Apache 
> project)
> - Start as an optional module if core integration is too heavy
>  
> I’m happy to work on a design doc or POC if there’s interest.
>  
> Thanks,  
> Chris

Re: [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL

Reply via email to