jmalkin opened a new issue, #56331:
URL: https://github.com/apache/spark/issues/56331

   The `approx_top_k` functions are based on the Apache DataSketches 
ItemsSketch, part of the Frequent Items family. But if you check the reference 
documentation for the sketch type you notice that the term `top k` does not 
appear: 
[https://datasketches.apache.org/docs/Frequency/FrequencySketches.html](https://datasketches.apache.org/docs/Frequency/FrequencySketches.html).
   
   The sketch design tracks frequent items or heavy hitters, allowing up to a 
maximum number -- but may also return fewer. The guarantee is that, based on 
the configured sketch size, items above some total percentage of the total 
weight of the stream will be returned. The sketch is performing correctly, 
under both the `NO_FALSE_POSITIVES` and `NO_FALSE_NEGATIVES` query modes, as 
long as that is true.
   
   It's quite simple to construct a sketch that returns no items despite having 
had multiple submitted. Using pure python as that's easier for demonstration, 
but it's all the same underlying sketch algorithm:
   ```python
   from datasketches import frequent_strings_sketch, frequent_items_error_type
   
   sk = frequent_strings_sketch(3)
   for i in range(1, 8):
       sk.update(f"{i}", 1)
   print(sk.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES))
   ```
   output:
   ```
   []
   ```
   (`NO_FALSE_NEGATIVES` is the more permissive query mode, so it will always 
return at least as many items as `NO_FALSE_POSITIVES`)
   
   Telling users that this is an "approximate top k" is quite misleading, which 
is why the DataSketches team very explicitly did not use such terminology. 
Renaming the functions and including documentation about how they work will 
avoid user complaints that their top k queries don't work, which may well not 
be rare depending the details of the streams, and is important for helping 
users understand when the sketch may be applicable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to