durgaprasadml opened a new pull request, #56333:
URL: https://github.com/apache/spark/pull/56333

   ## Summary
   
   Rename approx_top_k terminology to approx_frequent_items across the Spark 
codebase to better reflect the semantics of the underlying Apache DataSketches 
Frequent Items sketch implementation.
   
   The current naming is misleading because the sketch does not provide strict 
top-k guarantees. Instead, it identifies frequent items / heavy hitters based 
on threshold-oriented probabilistic guarantees and may legitimately return 
fewer than k items (or even none).
   
   This change aligns Spark terminology more closely with the Apache 
DataSketches documentation and expected sketch behavior.
   
   ## Changes
   
   ### Core implementation
   - Renamed Catalyst aggregate and expression implementations
   - Renamed aggregate registration references
   - Updated SQL function registration entries
   - Updated related error definitions and references
   
   ### APIs
   - Updated Scala SQL APIs
   - Updated PySpark APIs
   - Updated Spark Connect function definitions
   
   ### Documentation
   - Updated sql-ref-sketch-aggregates.md
   - Replaced misleading "top k" terminology with "frequent items" terminology
   - Clarified sketch semantics and behavior
   
   ### Tests
   - Renamed and updated related test suites
   - Regenerated SQL expression schema golden files
   - Fixed test expectations referencing old SQL names
   - Verified FrequentItemsSuite passes successfully
   
   ## Motivation
   
   The underlying implementation uses Apache DataSketches Frequent Items 
sketches, which:
   - identify heavy hitters / frequent items
   - provide threshold-based probabilistic guarantees
   - do not guarantee ranked top-k semantics
   
   For example, depending on stream distribution and sketch size, the sketch 
may validly return:
   - fewer than k items
   - or no items at all
   
   The previous naming created incorrect expectations for users and did not 
match the terminology used by Apache DataSketches itself.
   
   ## Validation
   
   Validated with:
   - FrequentItemsSuite
   - SQL expression schema regeneration
   - compilation and API updates
   - updated documentation and registry references
   
   ## Related Issue
   
   Fixes #56331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to