durgaprasadml opened a new pull request, #56333: URL: https://github.com/apache/spark/pull/56333
## Summary Rename approx_top_k terminology to approx_frequent_items across the Spark codebase to better reflect the semantics of the underlying Apache DataSketches Frequent Items sketch implementation. The current naming is misleading because the sketch does not provide strict top-k guarantees. Instead, it identifies frequent items / heavy hitters based on threshold-oriented probabilistic guarantees and may legitimately return fewer than k items (or even none). This change aligns Spark terminology more closely with the Apache DataSketches documentation and expected sketch behavior. ## Changes ### Core implementation - Renamed Catalyst aggregate and expression implementations - Renamed aggregate registration references - Updated SQL function registration entries - Updated related error definitions and references ### APIs - Updated Scala SQL APIs - Updated PySpark APIs - Updated Spark Connect function definitions ### Documentation - Updated sql-ref-sketch-aggregates.md - Replaced misleading "top k" terminology with "frequent items" terminology - Clarified sketch semantics and behavior ### Tests - Renamed and updated related test suites - Regenerated SQL expression schema golden files - Fixed test expectations referencing old SQL names - Verified FrequentItemsSuite passes successfully ## Motivation The underlying implementation uses Apache DataSketches Frequent Items sketches, which: - identify heavy hitters / frequent items - provide threshold-based probabilistic guarantees - do not guarantee ranked top-k semantics For example, depending on stream distribution and sketch size, the sketch may validly return: - fewer than k items - or no items at all The previous naming created incorrect expectations for users and did not match the terminology used by Apache DataSketches itself. ## Validation Validated with: - FrequentItemsSuite - SQL expression schema regeneration - compilation and API updates - updated documentation and registry references ## Related Issue Fixes #56331 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
