Re: [I] Fix name for frequent items/heavy hitters sketch from highly misleading "approx top k" [spark]

via GitHub Thu, 04 Jun 2026 12:24:55 -0700


durgaprasadml commented on issue #56331:
URL: https://github.com/apache/spark/issues/56331#issuecomment-4625403478


   Thanks for raising this — the concern makes sense.
   
   The current approx_top_k naming strongly implies approximate ranking 
semantics, while the underlying implementation is actually based on the Apache 
DataSketches Frequent Items / Heavy Hitters sketch family, which provides 
threshold-based guarantees rather than true top-k guarantees.
   
   As demonstrated in the example above, the sketch can legitimately return:
   
   * fewer than k items
   * or even zero items
   
   depending on the stream distribution and configured sketch size, while still 
behaving correctly according to the sketch guarantees.
   
   Using terminology aligned with the DataSketches documentation (frequent 
items / heavy hitters) would make the behavior much clearer to users and reduce 
incorrect expectations around strict top-k semantics.
   
   I’d like to work on this issue by:
   
   * introducing clearer canonical naming
   * preserving backward compatibility through aliases/deprecation paths
   * improving user/developer documentation around sketch guarantees and query 
modes
   * adding regression tests for edge cases like empty results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Fix name for frequent items/heavy hitters sketch from highly misleading "approx top k" [spark]

Reply via email to