Hi Chris,

We integrated DataSketches into Spark when we introduced the hll_sketch_*
UDFs - see the PR from 2023 <https://github.com/apache/spark/pull/40615>
for more info. I'm sure there'd be interest in exposing other types of
sketches, and I bet there'd be some potential for code-reuse between the
various sketch implementations!

Ryan Berti

Senior Data Engineer  |  Content & Studio DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028



On Tue, Jun 3, 2025 at 2:53 PM Boumalhab, Chris <cboum...@amazon.com.invalid>
wrote:

> Hi all,
>
>
>
> I’d like to start a discussion about adding support for [Apache
> DataSketches](https://datasketches.apache.org/) — specifically, Theta and
> Tuple Sketches — to Spark SQL and DataFrame APIs.
>
>
>
> ## Motivation
>
> These sketches allow scalable approximate set operations (like distinct
> count, unions, intersections, minus) and are well-suited for large-scale
> analytics. They are already used in production in systems like Druid,
> Presto, and others.
>
>
>
> Integrating them natively into Spark (e.g., as UDAFs or SQL functions)
> could offer performance and memory efficiency benefits for use cases such
> as:
>
> - Large cardinality distinct counts
>
> - Approximate aggregations over streaming/batch data
>
> - Set-based operations across datasets
>
>
>
> ## Proposed Scope
>
> - Add Theta and Tuple Sketch-based UDAFs to Spark SQL
>
> - Optional integration into `spark.sql` functions (e.g.,
> `approx_count_distinct_sketch`)
>
> - Use Apache DataSketches as a dependency (already an incubating Apache
> project)
>
> - Start as an optional module if core integration is too heavy
>
>
>
> I’m happy to work on a design doc or POC if there’s interest.
>
>
>
> Thanks,
>
> Chris
>

Reply via email to