Hi Menelaos, Thanks for pointing that out. HLL sketches do not support set operations such as intersection or difference. Tuple sketches would also allow value aggregation for the same key. For those reasons, I don’t believe HLL is enough.
Chris From: Menelaos Karavelas <menelaos.karave...@gmail.com> Date: Tuesday, June 3, 2025 at 6:15 PM To: "Boumalhab, Chris" <cboum...@amazon.com.INVALID> Cc: "dev@spark.apache.org" <dev@spark.apache.org> Subject: RE: [EXTERNAL] [DISCUSS] Proposal to Add Theta and Tuple Sketches to Spark SQL CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hello Chris. HLL sketches from the same project (Apache DataSketches) have already been integrated in Spark. How does your proposal fit given what I just mentioned? - Menelaos On Jun 3, 2025, at 2:52 PM, Boumalhab, Chris <cboum...@amazon.com.INVALID> wrote: Hi all, I’d like to start a discussion about adding support for [Apache DataSketches](https://datasketches.apache.org/) — specifically, Theta and Tuple Sketches — to Spark SQL and DataFrame APIs. ## Motivation These sketches allow scalable approximate set operations (like distinct count, unions, intersections, minus) and are well-suited for large-scale analytics. They are already used in production in systems like Druid, Presto, and others. Integrating them natively into Spark (e.g., as UDAFs or SQL functions) could offer performance and memory efficiency benefits for use cases such as: - Large cardinality distinct counts - Approximate aggregations over streaming/batch data - Set-based operations across datasets ## Proposed Scope - Add Theta and Tuple Sketch-based UDAFs to Spark SQL - Optional integration into `spark.sql` functions (e.g., `approx_count_distinct_sketch`) - Use Apache DataSketches as a dependency (already an incubating Apache project) - Start as an optional module if core integration is too heavy I’m happy to work on a design doc or POC if there’s interest. Thanks, Chris