On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi <dani...@grinta.net> wrote:

> What does k stand for here? As someone that never encountered this
> function before I find both names equally confusing. If I understand
> what the function is supposed to be doing, I think largest() would be
> much more descriptive.
>

`k` is the number of elements to return. `largest()` can connote that it's
only returning the one largest value. It's fairly typical to include a
dummy variable (`k` or `n`) in the name to indicate that the function lets
you specify how many you want. See, for example, the stdlib `heapq`
module's `nlargest()` function.

https://docs.python.org/3/library/heapq.html#heapq.nlargest

"top-k" comes from the ML community where this function is used to evaluate
classification models (`k` instead of `n` being largely an accident of
history, I imagine). In many classification problems, the number of classes
is very large, and they are very related to each other. For example,
ImageNet has a lot of different dog breeds broken out as separate classes.
In order to get a more balanced view of the relative performance of the
classification models, you often want to check whether the correct class is
in the top 5 classes (or whatever `k` is appropriate) that the model
predicted for the example, not just the one class that the model says is
the most likely. "5 largest" doesn't really work in the sentences that one
usually writes when talking about ML classifiers; they are talking about
the 5 classes that are associated with the 5 largest values from the
predictor, not the values themselves. So "top k" is what gets used in ML
discussions, and that transfers over to the name of the function in ML
libraries.

It is a top-down reflection of the higher level thing that people want to
compute (in that context) rather than a bottom-up description of how the
function is manipulating the input, if that makes sense. Either one is a
valid way to name things. There is a lot to be said for numpy's
domain-agnostic nature that we should prefer the bottom-up description
style of naming. However, we are also in the midst of a diversifying
ecosystem of array libraries, largely driven by the ML domain, and adopting
some of that terminology when we try to enhance our interoperability with
those libraries is also a factor to be considered.

-- 
Robert Kern
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to