[
https://issues.apache.org/jira/browse/SPARK-55939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bo Xiong updated SPARK-55939:
-----------------------------
Description:
h3. Motivation
Spark SQL already provides built-in support for several Apache DataSketches
algorithms:
* HyperLogLog (HLL) sketches for approximate count distinct
* Theta sketches for count distinct with set operations
* Tuple sketches for count distinct with aggregated summaries
* KLL sketches for approximate quantiles
* Top-K sketches for approximate top-K items
This PR adds built-in support for the *DataSketches ItemsSketch* (Frequent
Items), which tracks the approximate frequency of items in a data stream.
Unlike {{approx_top_k}} which only returns the top-K items, ItemsSketch
provides:
* Frequency estimates for any item (not just the top-K)
* Configurable error guarantees ({{{}NO_FALSE_POSITIVES{}}} vs
{{{}NO_FALSE_NEGATIVES{}}})
* Mergeable binary representations for multi-level rollup aggregation
* Support for a wide range of data types (boolean, numeric, string, decimal,
date, timestamp)
h3. Use Cases
* {*}Finding top sellers across time horizons{*}: Build daily per-category
sketches, then merge across days to find weekly/monthly top sellers without
re-scanning raw data.
* {*}Frequency estimation in streaming{*}: Maintain running frequency sketches
that can be merged across micro-batches.
* {*}Multi-dimensional rollup{*}: Build sketches at child-level dimensions and
merge up to parent dimensions.
was:
## Description
### Motivation
Spark SQL already provides built-in support for several Apache DataSketches
algorithms:
- HyperLogLog (HLL) sketches for approximate count distinct
- Theta sketches for count distinct with set operations
- Tuple sketches for count distinct with aggregated summaries
- KLL sketches for approximate quantiles
- Top-K sketches for approximate top-K items
This PR adds built-in support for the **DataSketches ItemsSketch** (Frequent
Items), which tracks the approximate frequency of items in a data stream.
Unlike `approx_top_k` which only returns the top-K items, ItemsSketch provides:
- Frequency estimates for any item (not just the top-K)
- Configurable error guarantees (`NO_FALSE_POSITIVES` vs `NO_FALSE_NEGATIVES`)
- Mergeable binary representations for multi-level rollup aggregation
- Support for a wide range of data types (boolean, numeric, string, decimal,
date, timestamp)
### Use Cases
- **Finding top sellers across time horizons**: Build daily per-category
sketches, then merge across days to find weekly/monthly top sellers without
re-scanning raw data.
- **Frequency estimation in streaming**: Maintain running frequency sketches
that can be merged across micro-batches.
- **Multi-dimensional rollup**: Build sketches at child-level dimensions and
merge up to parent dimensions.
### Functions Added (6 total)
**Aggregate functions:**
1. `items_sketch_agg(expr [, maxMapSize])` — Builds an ItemsSketch from input
values. Returns binary.
2. `items_sketch_merge_agg(sketch [, maxMapSize])` — Merges pre-built
ItemsSketch binaries into one. Returns binary.
**Scalar functions:**
3. `items_sketch_get_frequent_items(sketch, errorType)` — Returns frequent
items as an array of structs `{item, estimate, lowerBound, upperBound}`.
`errorType` must be `'NO_FALSE_POSITIVES'` or `'NO_FALSE_NEGATIVES'`.
4. `items_sketch_get_estimate(sketch, item)` — Returns the estimated frequency
of a specific item.
5. `items_sketch_merge(first, second)` — Merges two ItemsSketch binaries
(scalar).
6. `items_sketch_to_string(sketch)` — Returns a human-readable summary string.
### Parameters
- `maxMapSize`: Controls sketch accuracy. Must be a power of 2, range [8,
67108864]. Default: 4096. Larger values use more memory but provide more
accurate frequency estimates. The effective capacity is `0.75 * maxMapSize`.
> Add built-in DataSketches ItemsSketch (Frequent Items) functions to Spark SQL
> -----------------------------------------------------------------------------
>
> Key: SPARK-55939
> URL: https://issues.apache.org/jira/browse/SPARK-55939
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Bo Xiong
> Assignee: Bo Xiong
> Priority: Major
> Labels: sketch,most-frequent,top-k
> Fix For: 4.2.0
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> h3. Motivation
> Spark SQL already provides built-in support for several Apache DataSketches
> algorithms:
> * HyperLogLog (HLL) sketches for approximate count distinct
> * Theta sketches for count distinct with set operations
> * Tuple sketches for count distinct with aggregated summaries
> * KLL sketches for approximate quantiles
> * Top-K sketches for approximate top-K items
> This PR adds built-in support for the *DataSketches ItemsSketch* (Frequent
> Items), which tracks the approximate frequency of items in a data stream.
> Unlike {{approx_top_k}} which only returns the top-K items, ItemsSketch
> provides:
> * Frequency estimates for any item (not just the top-K)
> * Configurable error guarantees ({{{}NO_FALSE_POSITIVES{}}} vs
> {{{}NO_FALSE_NEGATIVES{}}})
> * Mergeable binary representations for multi-level rollup aggregation
> * Support for a wide range of data types (boolean, numeric, string, decimal,
> date, timestamp)
> h3. Use Cases
> * {*}Finding top sellers across time horizons{*}: Build daily per-category
> sketches, then merge across days to find weekly/monthly top sellers without
> re-scanning raw data.
> * {*}Frequency estimation in streaming{*}: Maintain running frequency
> sketches that can be merged across micro-batches.
> * {*}Multi-dimensional rollup{*}: Build sketches at child-level dimensions
> and merge up to parent dimensions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]