[I] feat: Add `approx_top_k` aggregate function for approximate heavy hitter / top-k frequency estimation [datafusion]

via GitHub Mon, 16 Mar 2026 10:12:41 -0700


sesteves opened a new issue, #20967:
URL: https://github.com/apache/datafusion/issues/20967

## Is your feature request related to a problem or challenge?

DataFusion currently lacks a built-in approximate top-k / heavy hitter
aggregate function. Users who need to find the most frequently occurring values
in a column must resort to `GROUP BY value ORDER BY COUNT(*) DESC LIMIT k`,
which requires materializing all distinct groups, sorting, and truncating — a
process that is both memory-intensive and slow on high-cardinality columns.

This is a common analytical query pattern with well-known streaming
approximation algorithms that provide bounded-error results in constant memory.
Other query engines already support this:

- **ClickHouse**: [`topK` /
`topKWeighted`](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/topk)
— uses the Filtered Space-Saving algorithm
- **PostgreSQL**: Available via extensions (e.g., `datasketches`)
- **Druid**: Built-in approximate top-k via Data Sketches

## Describe the solution you'd like

Add an `approx_top_k(expression, k)` aggregate function that returns an
approximate list of the top `k` most frequent values along with their estimated
counts, using a streaming algorithm that operates in bounded memory.

### Algorithm: Filtered Space-Saving

The implementation should use the **Space-Saving** algorithm (Metwally et
al., 2005), which maintains a fixed-size summary of counters. When a new item
arrives and the summary is full, the item with the minimum count is evicted and
replaced. This guarantees that all items with frequency above `N/capacity` will
be tracked.

The accuracy is further improved with an **alpha map** (as described in the
Filtered Space-Saving variant), which tracks recently evicted items and filters
out low-frequency noise before it enters the main summary. This is the same
approach used in **ClickHouse's `topK` implementation**.

Key design points:

- **`CAPACITY_MULTIPLIER = 3`**: The internal summary tracks `k * 3`
counters (matching ClickHouse's default). Over-provisioning relative to `k`
significantly improves accuracy for skewed distributions.
- **Mergeable state**: Summaries from different partitions can be merged,
enabling parallel / distributed execution.
- **Serializable**: The summary can be serialized to/from bytes for
intermediate state transfer (e.g., proto roundtrip).
- **Type support**: Works with any hashable scalar type (strings, integers,
floats, etc.).

### SQL Syntax

```sql
-- Returns a list of structs {value, count} for the top k most frequent
values
SELECT approx_top_k(column_name, 5) FROM table;

-- With GROUP BY
SELECT group_col, approx_top_k(value_col, 3) FROM table GROUP BY group_col;

-- k defaults to 5 if omitted
SELECT approx_top_k(column_name) FROM table;
```

### Return Type

Returns a `List(Struct({value: T, count: UInt64}))` ordered by count
descending, where `T` matches the input column type.

## Describe alternatives you've considered

1. **Exact `GROUP BY ... ORDER BY COUNT(*) DESC LIMIT k`** — works but
requires full materialization of all groups, which is prohibitive on
high-cardinality columns (millions of distinct values).
2. **External sketching libraries** — users can implement this as a UDAF,
but having it built-in lowers the barrier and ensures it's well-tested and
optimized within DataFusion's execution model.
3. **Count-Min Sketch + Heap** — another approximate approach, but
Space-Saving provides deterministic error bounds and is simpler to implement
correctly for top-k specifically.

## Additional context

- The Space-Saving algorithm provides the guarantee that any item with true
frequency ≥ `N/capacity` will appear in the summary.
- The Filtered Space-Saving extension (with alpha map) reduces over-counting
of infrequent items, improving result quality in practice.
- This is particularly useful for observability (top error codes, top URLs),
analytics (most popular products, most active users), and data profiling
(frequent value detection).
- ClickHouse has validated this algorithm at scale in production for years.

### References

- Metwally, A., Agrawal, D., & El Abbadi, A. (2005). *Efficient Computation
of Frequent and Top-k Elements in Data Streams.*
- [ClickHouse `topK`
documentation](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/topk)
- [Filtered Space-Saving
description](https://doi.org/10.1016/j.ins.2015.11.002)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] feat: Add `approx_top_k` aggregate function for approximate heavy hitter / top-k frequency estimation [datafusion]

Reply via email to