durgaprasadml commented on issue #56331: URL: https://github.com/apache/spark/issues/56331#issuecomment-4625403478
Thanks for raising this — the concern makes sense. The current approx_top_k naming strongly implies approximate ranking semantics, while the underlying implementation is actually based on the Apache DataSketches Frequent Items / Heavy Hitters sketch family, which provides threshold-based guarantees rather than true top-k guarantees. As demonstrated in the example above, the sketch can legitimately return: * fewer than k items * or even zero items depending on the stream distribution and configured sketch size, while still behaving correctly according to the sketch guarantees. Using terminology aligned with the DataSketches documentation (frequent items / heavy hitters) would make the behavior much clearer to users and reduce incorrect expectations around strict top-k semantics. I’d like to work on this issue by: * introducing clearer canonical naming * preserving backward compatibility through aliases/deprecation paths * improving user/developer documentation around sketch guarantees and query modes * adding regression tests for edge cases like empty results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
