Hi Ryan,

Thanks for the reply! Would you recommend I put in a JIRA ticket and consider 
developing this? I’m not familiar with the process.

Chris

From: Ryan Berti <rbe...@netflix.com.INVALID>
Date: Tuesday, June 3, 2025 at 6:13 PM
To: "cboum...@amazon.com.invalid" <cboum...@amazon.com.invalid>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Subject: RE: [EXTERNAL] [DISCUSS] Proposal to Add Theta and Tuple Sketches to 
Spark SQL


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Hi Chris,

We integrated DataSketches into Spark when we introduced the hll_sketch_* UDFs 
- see the PR from 2023<https://github.com/apache/spark/pull/40615> for more 
info. I'm sure there'd be interest in exposing other types of sketches, and I 
bet there'd be some potential for code-reuse between the various sketch 
implementations!


Ryan Berti

Senior Data Engineer  |  Content & Studio DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028

[Image removed by sender.]


On Tue, Jun 3, 2025 at 2:53 PM Boumalhab, Chris <cboum...@amazon.com.invalid> 
wrote:
Hi all,

I’d like to start a discussion about adding support for [Apache 
DataSketches](https://datasketches.apache.org/) — specifically, Theta and Tuple 
Sketches — to Spark SQL and DataFrame APIs.

## Motivation
These sketches allow scalable approximate set operations (like distinct count, 
unions, intersections, minus) and are well-suited for large-scale analytics. 
They are already used in production in systems like Druid, Presto, and others.

Integrating them natively into Spark (e.g., as UDAFs or SQL functions) could 
offer performance and memory efficiency benefits for use cases such as:
- Large cardinality distinct counts
- Approximate aggregations over streaming/batch data
- Set-based operations across datasets

## Proposed Scope
- Add Theta and Tuple Sketch-based UDAFs to Spark SQL
- Optional integration into `spark.sql` functions (e.g., 
`approx_count_distinct_sketch`)
- Use Apache DataSketches as a dependency (already an incubating Apache project)
- Start as an optional module if core integration is too heavy

I’m happy to work on a design doc or POC if there’s interest.

Thanks,
Chris

Reply via email to