Yes, HLL sketches do not support the operations you mention, and this is actually a good reason to add other types of sketches.
Ryan beat me to answering :) Datasketches is already a dependency, so it should make some things easier. Regarding the user facing functionality, could you please be more specific as to what you propose? There is already “approx_count_distinct", and I am afraid that for example "approx_count_distinct_sketch” might be misleading or confusing (which sketch?) - Menelaos > On Jun 3, 2025, at 3:33 PM, Boumalhab, Chris <cboum...@amazon.com> wrote: > > Hi Menelaos, > > Thanks for pointing that out. HLL sketches do not support set operations such > as intersection or difference. Tuple sketches would also allow value > aggregation for the same key. For those reasons, I don’t believe HLL is > enough. > > Chris > > From: Menelaos Karavelas <menelaos.karave...@gmail.com > <mailto:menelaos.karave...@gmail.com>> > Date: Tuesday, June 3, 2025 at 6:15 PM > To: "Boumalhab, Chris" <cboum...@amazon.com.INVALID > <mailto:cboum...@amazon.com.INVALID>> > Cc: "dev@spark.apache.org <mailto:dev@spark.apache.org>" > <dev@spark.apache.org <mailto:dev@spark.apache.org>> > Subject: RE: [EXTERNAL] [DISCUSS] Proposal to Add Theta and Tuple Sketches to > Spark SQL > > CAUTION: This email originated from outside of the organization. Do not click > links or open attachments unless you can confirm the sender and know the > content is safe. > > > Hello Chris. > > HLL sketches from the same project (Apache DataSketches) have already been > integrated in Spark. > > How does your proposal fit given what I just mentioned? > > - Menelaos > > > On Jun 3, 2025, at 2:52 PM, Boumalhab, Chris <cboum...@amazon.com.INVALID> > wrote: > > Hi all, > > I’d like to start a discussion about adding support for [Apache > DataSketches](https://datasketches.apache.org/) — specifically, Theta and > Tuple Sketches — to Spark SQL and DataFrame APIs. > > ## Motivation > These sketches allow scalable approximate set operations (like distinct > count, unions, intersections, minus) and are well-suited for large-scale > analytics. They are already used in production in systems like Druid, Presto, > and others. > > Integrating them natively into Spark (e.g., as UDAFs or SQL functions) could > offer performance and memory efficiency benefits for use cases such as: > - Large cardinality distinct counts > - Approximate aggregations over streaming/batch data > - Set-based operations across datasets > > ## Proposed Scope > - Add Theta and Tuple Sketch-based UDAFs to Spark SQL > - Optional integration into `spark.sql` functions (e.g., > `approx_count_distinct_sketch`) > - Use Apache DataSketches as a dependency (already an incubating Apache > project) > - Start as an optional module if core integration is too heavy > > I’m happy to work on a design doc or POC if there’s interest. > > Thanks, > Chris