jmalkin opened a new issue, #56331: URL: https://github.com/apache/spark/issues/56331
The `approx_top_k` functions are based on the Apache DataSketches ItemsSketch, part of the Frequent Items family. But if you check the reference documentation for the sketch type you notice that the term `top k` does not appear: [https://datasketches.apache.org/docs/Frequency/FrequencySketches.html](https://datasketches.apache.org/docs/Frequency/FrequencySketches.html). The sketch design tracks frequent items or heavy hitters, allowing up to a maximum number -- but may also return fewer. The guarantee is that, based on the configured sketch size, items above some total percentage of the total weight of the stream will be returned. The sketch is performing correctly, under both the `NO_FALSE_POSITIVES` and `NO_FALSE_NEGATIVES` query modes, as long as that is true. It's quite simple to construct a sketch that returns no items despite having had multiple submitted. Using pure python as that's easier for demonstration, but it's all the same underlying sketch algorithm: ```python from datasketches import frequent_strings_sketch, frequent_items_error_type sk = frequent_strings_sketch(3) for i in range(1, 8): sk.update(f"{i}", 1) print(sk.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES)) ``` output: ``` [] ``` (`NO_FALSE_NEGATIVES` is the more permissive query mode, so it will always return at least as many items as `NO_FALSE_POSITIVES`) Telling users that this is an "approximate top k" is quite misleading, which is why the DataSketches team very explicitly did not use such terminology. Renaming the functions and including documentation about how they work will avoid user complaints that their top k queries don't work, which may well not be rare depending the details of the streams, and is important for helping users understand when the sketch may be applicable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
