[ 
https://issues.apache.org/jira/browse/SPARK-55939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-55939:
-----------------------------
    Description: 
## Description


### Motivation


Spark SQL already provides built-in support for several Apache DataSketches 
algorithms:
- HyperLogLog (HLL) sketches for approximate count distinct
- Theta sketches for count distinct with set operations
- Tuple sketches for count distinct with aggregated summaries
- KLL sketches for approximate quantiles
- Top-K sketches for approximate top-K items


This PR adds built-in support for the **DataSketches ItemsSketch** (Frequent 
Items), which tracks the approximate frequency of items in a data stream. 
Unlike `approx_top_k` which only returns the top-K items, ItemsSketch provides:
- Frequency estimates for any item (not just the top-K)
- Configurable error guarantees (`NO_FALSE_POSITIVES` vs `NO_FALSE_NEGATIVES`)
- Mergeable binary representations for multi-level rollup aggregation
- Support for a wide range of data types (boolean, numeric, string, decimal, 
date, timestamp)


### Use Cases


- **Finding top sellers across time horizons**: Build daily per-category 
sketches, then merge across days to find weekly/monthly top sellers without 
re-scanning raw data.
- **Frequency estimation in streaming**: Maintain running frequency sketches 
that can be merged across micro-batches.
- **Multi-dimensional rollup**: Build sketches at child-level dimensions and 
merge up to parent dimensions.


### Functions Added (6 total)


**Aggregate functions:**
1. `items_sketch_agg(expr [, maxMapSize])` — Builds an ItemsSketch from input 
values. Returns binary.
2. `items_sketch_merge_agg(sketch [, maxMapSize])` — Merges pre-built 
ItemsSketch binaries into one. Returns binary.


**Scalar functions:**
3. `items_sketch_get_frequent_items(sketch, errorType)` — Returns frequent 
items as an array of structs `\{item, estimate, lowerBound, upperBound}`. 
`errorType` must be `'NO_FALSE_POSITIVES'` or `'NO_FALSE_NEGATIVES'`.
4. `items_sketch_get_estimate(sketch, item)` — Returns the estimated frequency 
of a specific item.
5. `items_sketch_merge(first, second)` — Merges two ItemsSketch binaries 
(scalar).
6. `items_sketch_to_string(sketch)` — Returns a human-readable summary string.


### Parameters


- `maxMapSize`: Controls sketch accuracy. Must be a power of 2, range [8, 
67108864]. Default: 4096. Larger values use more memory but provide more 
accurate frequency estimates. The effective capacity is `0.75 * maxMapSize`.


### Supported Data Types


Boolean, Byte, Short, Int, Long, Float, Double, Date, Timestamp, TimestampNTZ, 
String, Decimal.


### Wire Format


The binary output embeds the data type DDL alongside the sketch bytes:
```
[ddlLength (4 bytes)][ddlBytes (n bytes)][sketchBytes (remaining)]
```
This enables downstream scalar functions to deserialize without requiring 
explicit type information.

 

> Add built-in DataSketches ItemsSketch (Frequent Items) functions to Spark SQL
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-55939
>                 URL: https://issues.apache.org/jira/browse/SPARK-55939
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Bo Xiong
>            Assignee: Bo Xiong
>            Priority: Major
>              Labels: sketch,most-frequent,top-k
>             Fix For: 4.2.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> ## Description
> ### Motivation
> Spark SQL already provides built-in support for several Apache DataSketches 
> algorithms:
> - HyperLogLog (HLL) sketches for approximate count distinct
> - Theta sketches for count distinct with set operations
> - Tuple sketches for count distinct with aggregated summaries
> - KLL sketches for approximate quantiles
> - Top-K sketches for approximate top-K items
> This PR adds built-in support for the **DataSketches ItemsSketch** (Frequent 
> Items), which tracks the approximate frequency of items in a data stream. 
> Unlike `approx_top_k` which only returns the top-K items, ItemsSketch 
> provides:
> - Frequency estimates for any item (not just the top-K)
> - Configurable error guarantees (`NO_FALSE_POSITIVES` vs `NO_FALSE_NEGATIVES`)
> - Mergeable binary representations for multi-level rollup aggregation
> - Support for a wide range of data types (boolean, numeric, string, decimal, 
> date, timestamp)
> ### Use Cases
> - **Finding top sellers across time horizons**: Build daily per-category 
> sketches, then merge across days to find weekly/monthly top sellers without 
> re-scanning raw data.
> - **Frequency estimation in streaming**: Maintain running frequency sketches 
> that can be merged across micro-batches.
> - **Multi-dimensional rollup**: Build sketches at child-level dimensions and 
> merge up to parent dimensions.
> ### Functions Added (6 total)
> **Aggregate functions:**
> 1. `items_sketch_agg(expr [, maxMapSize])` — Builds an ItemsSketch from input 
> values. Returns binary.
> 2. `items_sketch_merge_agg(sketch [, maxMapSize])` — Merges pre-built 
> ItemsSketch binaries into one. Returns binary.
> **Scalar functions:**
> 3. `items_sketch_get_frequent_items(sketch, errorType)` — Returns frequent 
> items as an array of structs `\{item, estimate, lowerBound, upperBound}`. 
> `errorType` must be `'NO_FALSE_POSITIVES'` or `'NO_FALSE_NEGATIVES'`.
> 4. `items_sketch_get_estimate(sketch, item)` — Returns the estimated 
> frequency of a specific item.
> 5. `items_sketch_merge(first, second)` — Merges two ItemsSketch binaries 
> (scalar).
> 6. `items_sketch_to_string(sketch)` — Returns a human-readable summary string.
> ### Parameters
> - `maxMapSize`: Controls sketch accuracy. Must be a power of 2, range [8, 
> 67108864]. Default: 4096. Larger values use more memory but provide more 
> accurate frequency estimates. The effective capacity is `0.75 * maxMapSize`.
> ### Supported Data Types
> Boolean, Byte, Short, Int, Long, Float, Double, Date, Timestamp, 
> TimestampNTZ, String, Decimal.
> ### Wire Format
> The binary output embeds the data type DDL alongside the sketch bytes:
> ```
> [ddlLength (4 bytes)][ddlBytes (n bytes)][sketchBytes (remaining)]
> ```
> This enables downstream scalar functions to deserialize without requiring 
> explicit type information.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to