Durgaprasad M L created SPARK-57319:
---------------------------------------
Summary: Rename misleading approx_top_k terminology to
approx_frequent_items
Key: SPARK-57319
URL: https://issues.apache.org/jira/browse/SPARK-57319
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.1.1
Reporter: Durgaprasad M L
The current approx_top_k naming in Spark is misleading because the underlying
implementation is based on Apache DataSketches Frequent Items sketches, which
do not provide strict top-k guarantees.
Instead, the sketch identifies frequent items / heavy hitters using
threshold-oriented probabilistic guarantees and may legitimately return fewer
than k items or no items at all depending on stream distribution and sketch
configuration.
This improvement proposes:
- renaming approx_top_k terminology to approx_frequent_items
- aligning terminology with Apache DataSketches documentation
- preserving backward compatibility through deprecated aliases
- updating Scala, PySpark, Spark Connect APIs, docs, and test suites
Related PR:
https://github.com/apache/spark/pull/56333
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]