[
https://issues.apache.org/jira/browse/SPARK-55939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bo Xiong updated SPARK-55939:
-----------------------------
Description:
## Description
### Motivation
Spark SQL already provides built-in support for several Apache DataSketches
algorithms:
- HyperLogLog (HLL) sketches for approximate count distinct
- Theta sketches for count distinct with set operations
- Tuple sketches for count distinct with aggregated summaries
- KLL sketches for approximate quantiles
- Top-K sketches for approximate top-K items
This PR adds built-in support for the **DataSketches ItemsSketch** (Frequent
Items), which tracks the approximate frequency of items in a data stream.
Unlike `approx_top_k` which only returns the top-K items, ItemsSketch provides:
- Frequency estimates for any item (not just the top-K)
- Configurable error guarantees (`NO_FALSE_POSITIVES` vs `NO_FALSE_NEGATIVES`)
- Mergeable binary representations for multi-level rollup aggregation
- Support for a wide range of data types (boolean, numeric, string, decimal,
date, timestamp)
### Use Cases
- **Finding top sellers across time horizons**: Build daily per-category
sketches, then merge across days to find weekly/monthly top sellers without
re-scanning raw data.
- **Frequency estimation in streaming**: Maintain running frequency sketches
that can be merged across micro-batches.
- **Multi-dimensional rollup**: Build sketches at child-level dimensions and
merge up to parent dimensions.
### Functions Added (6 total)
**Aggregate functions:**
1. `items_sketch_agg(expr [, maxMapSize])` — Builds an ItemsSketch from input
values. Returns binary.
2. `items_sketch_merge_agg(sketch [, maxMapSize])` — Merges pre-built
ItemsSketch binaries into one. Returns binary.
**Scalar functions:**
3. `items_sketch_get_frequent_items(sketch, errorType)` — Returns frequent
items as an array of structs `{item, estimate, lowerBound, upperBound}`.
`errorType` must be `'NO_FALSE_POSITIVES'` or `'NO_FALSE_NEGATIVES'`.
4. `items_sketch_get_estimate(sketch, item)` — Returns the estimated frequency
of a specific item.
5. `items_sketch_merge(first, second)` — Merges two ItemsSketch binaries
(scalar).
6. `items_sketch_to_string(sketch)` — Returns a human-readable summary string.
### Parameters
- `maxMapSize`: Controls sketch accuracy. Must be a power of 2, range [8,
67108864]. Default: 4096. Larger values use more memory but provide more
accurate frequency estimates. The effective capacity is `0.75 * maxMapSize`.
was:
#
## ## Description
### Motivation
Spark SQL already provides built-in support for several Apache DataSketches
algorithms:
- HyperLogLog (HLL) sketches for approximate count distinct
- Theta sketches for count distinct with set operations
- Tuple sketches for count distinct with aggregated summaries
- KLL sketches for approximate quantiles
- Top-K sketches for approximate top-K items
This PR adds built-in support for the **DataSketches ItemsSketch** (Frequent
Items), which tracks the approximate frequency of items in a data stream.
Unlike `approx_top_k` which only returns the top-K items, ItemsSketch provides:
- Frequency estimates for any item (not just the top-K)
- Configurable error guarantees (`NO_FALSE_POSITIVES` vs `NO_FALSE_NEGATIVES`)
- Mergeable binary representations for multi-level rollup aggregation
- Support for a wide range of data types (boolean, numeric, string, decimal,
date, timestamp)
### Use Cases
- **Finding top sellers across time horizons**: Build daily per-category
sketches, then merge across days to find weekly/monthly top sellers without
re-scanning raw data.
- **Frequency estimation in streaming**: Maintain running frequency sketches
that can be merged across micro-batches.
- **Multi-dimensional rollup**: Build sketches at child-level dimensions and
merge up to parent dimensions.
### Functions Added (6 total)
**Aggregate functions:**
1. `items_sketch_agg(expr [, maxMapSize])` — Builds an ItemsSketch from input
values. Returns binary.
2. `items_sketch_merge_agg(sketch [, maxMapSize])` — Merges pre-built
ItemsSketch binaries into one. Returns binary.
**Scalar functions:**
3. `items_sketch_get_frequent_items(sketch, errorType)` — Returns frequent
items as an array of structs `\{item, estimate, lowerBound, upperBound}`.
`errorType` must be `'NO_FALSE_POSITIVES'` or `'NO_FALSE_NEGATIVES'`.
4. `items_sketch_get_estimate(sketch, item)` — Returns the estimated frequency
of a specific item.
5. `items_sketch_merge(first, second)` — Merges two ItemsSketch binaries
(scalar).
6. `items_sketch_to_string(sketch)` — Returns a human-readable summary string.
### Parameters
- `maxMapSize`: Controls sketch accuracy. Must be a power of 2, range [8,
67108864]. Default: 4096. Larger values use more memory but provide more
accurate frequency estimates. The effective capacity is `0.75 * maxMapSize`.
> Add built-in DataSketches ItemsSketch (Frequent Items) functions to Spark SQL
> -----------------------------------------------------------------------------
>
> Key: SPARK-55939
> URL: https://issues.apache.org/jira/browse/SPARK-55939
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Bo Xiong
> Assignee: Bo Xiong
> Priority: Major
> Labels: sketch,most-frequent,top-k
> Fix For: 4.2.0
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> ## Description
> ### Motivation
> Spark SQL already provides built-in support for several Apache DataSketches
> algorithms:
> - HyperLogLog (HLL) sketches for approximate count distinct
> - Theta sketches for count distinct with set operations
> - Tuple sketches for count distinct with aggregated summaries
> - KLL sketches for approximate quantiles
> - Top-K sketches for approximate top-K items
> This PR adds built-in support for the **DataSketches ItemsSketch** (Frequent
> Items), which tracks the approximate frequency of items in a data stream.
> Unlike `approx_top_k` which only returns the top-K items, ItemsSketch
> provides:
> - Frequency estimates for any item (not just the top-K)
> - Configurable error guarantees (`NO_FALSE_POSITIVES` vs `NO_FALSE_NEGATIVES`)
> - Mergeable binary representations for multi-level rollup aggregation
> - Support for a wide range of data types (boolean, numeric, string, decimal,
> date, timestamp)
> ### Use Cases
> - **Finding top sellers across time horizons**: Build daily per-category
> sketches, then merge across days to find weekly/monthly top sellers without
> re-scanning raw data.
> - **Frequency estimation in streaming**: Maintain running frequency sketches
> that can be merged across micro-batches.
> - **Multi-dimensional rollup**: Build sketches at child-level dimensions and
> merge up to parent dimensions.
> ### Functions Added (6 total)
> **Aggregate functions:**
> 1. `items_sketch_agg(expr [, maxMapSize])` — Builds an ItemsSketch from input
> values. Returns binary.
> 2. `items_sketch_merge_agg(sketch [, maxMapSize])` — Merges pre-built
> ItemsSketch binaries into one. Returns binary.
> **Scalar functions:**
> 3. `items_sketch_get_frequent_items(sketch, errorType)` — Returns frequent
> items as an array of structs `{item, estimate, lowerBound, upperBound}`.
> `errorType` must be `'NO_FALSE_POSITIVES'` or `'NO_FALSE_NEGATIVES'`.
> 4. `items_sketch_get_estimate(sketch, item)` — Returns the estimated
> frequency of a specific item.
> 5. `items_sketch_merge(first, second)` — Merges two ItemsSketch binaries
> (scalar).
> 6. `items_sketch_to_string(sketch)` — Returns a human-readable summary string.
> ### Parameters
> - `maxMapSize`: Controls sketch accuracy. Must be a power of 2, range [8,
> 67108864]. Default: 4096. Larger values use more memory but provide more
> accurate frequency estimates. The effective capacity is `0.75 * maxMapSize`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]