sesteves opened a new issue, #20967: URL: https://github.com/apache/datafusion/issues/20967
## Is your feature request related to a problem or challenge? DataFusion currently lacks a built-in approximate top-k / heavy hitter aggregate function. Users who need to find the most frequently occurring values in a column must resort to `GROUP BY value ORDER BY COUNT(*) DESC LIMIT k`, which requires materializing all distinct groups, sorting, and truncating — a process that is both memory-intensive and slow on high-cardinality columns. This is a common analytical query pattern with well-known streaming approximation algorithms that provide bounded-error results in constant memory. Other query engines already support this: - **ClickHouse**: [`topK` / `topKWeighted`](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/topk) — uses the Filtered Space-Saving algorithm - **PostgreSQL**: Available via extensions (e.g., `datasketches`) - **Druid**: Built-in approximate top-k via Data Sketches ## Describe the solution you'd like Add an `approx_top_k(expression, k)` aggregate function that returns an approximate list of the top `k` most frequent values along with their estimated counts, using a streaming algorithm that operates in bounded memory. ### Algorithm: Filtered Space-Saving The implementation should use the **Space-Saving** algorithm (Metwally et al., 2005), which maintains a fixed-size summary of counters. When a new item arrives and the summary is full, the item with the minimum count is evicted and replaced. This guarantees that all items with frequency above `N/capacity` will be tracked. The accuracy is further improved with an **alpha map** (as described in the Filtered Space-Saving variant), which tracks recently evicted items and filters out low-frequency noise before it enters the main summary. This is the same approach used in **ClickHouse's `topK` implementation**. Key design points: - **`CAPACITY_MULTIPLIER = 3`**: The internal summary tracks `k * 3` counters (matching ClickHouse's default). Over-provisioning relative to `k` significantly improves accuracy for skewed distributions. - **Mergeable state**: Summaries from different partitions can be merged, enabling parallel / distributed execution. - **Serializable**: The summary can be serialized to/from bytes for intermediate state transfer (e.g., proto roundtrip). - **Type support**: Works with any hashable scalar type (strings, integers, floats, etc.). ### SQL Syntax ```sql -- Returns a list of structs {value, count} for the top k most frequent values SELECT approx_top_k(column_name, 5) FROM table; -- With GROUP BY SELECT group_col, approx_top_k(value_col, 3) FROM table GROUP BY group_col; -- k defaults to 5 if omitted SELECT approx_top_k(column_name) FROM table; ``` ### Return Type Returns a `List(Struct({value: T, count: UInt64}))` ordered by count descending, where `T` matches the input column type. ## Describe alternatives you've considered 1. **Exact `GROUP BY ... ORDER BY COUNT(*) DESC LIMIT k`** — works but requires full materialization of all groups, which is prohibitive on high-cardinality columns (millions of distinct values). 2. **External sketching libraries** — users can implement this as a UDAF, but having it built-in lowers the barrier and ensures it's well-tested and optimized within DataFusion's execution model. 3. **Count-Min Sketch + Heap** — another approximate approach, but Space-Saving provides deterministic error bounds and is simpler to implement correctly for top-k specifically. ## Additional context - The Space-Saving algorithm provides the guarantee that any item with true frequency ≥ `N/capacity` will appear in the summary. - The Filtered Space-Saving extension (with alpha map) reduces over-counting of infrequent items, improving result quality in practice. - This is particularly useful for observability (top error codes, top URLs), analytics (most popular products, most active users), and data profiling (frequent value detection). - ClickHouse has validated this algorithm at scale in production for years. ### References - Metwally, A., Agrawal, D., & El Abbadi, A. (2005). *Efficient Computation of Frequent and Top-k Elements in Data Streams.* - [ClickHouse `topK` documentation](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/topk) - [Filtered Space-Saving description](https://doi.org/10.1016/j.ins.2015.11.002) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
