Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

Ilhan Polat Sat, 29 May 2021 23:38:31 -0700

Since this going into the top namespace, I'd also vote against the matlab-y
"topk" name. And even matlab didn't do what I would expect and went with
maxk


https://nl.mathworks.com/help/matlab/ref/maxk.html

I think "max_k" is a good generalization of the regular "max". Even when
auto-completing, this showing up under max makes sense to me instead of
searching them inside "t"s. Besides, "argmax_k" also follows suite, that of
the previous convention. To my eyes this is an acceptable disturbance to an
already very crowded namespace.



a few moments later....

But then again an ugly idea rears its head proposing this going into the
existing max function. But I'll shut up now :)







On Sun, May 30, 2021 at 12:50 AM Robert Kern <robert.k...@gmail.com> wrote:

> On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi <dani...@grinta.net>
> wrote:
>
>> What does k stand for here? As someone that never encountered this
>> function before I find both names equally confusing. If I understand
>> what the function is supposed to be doing, I think largest() would be
>> much more descriptive.
>>
>
> `k` is the number of elements to return. `largest()` can connote that it's
> only returning the one largest value. It's fairly typical to include a
> dummy variable (`k` or `n`) in the name to indicate that the function lets
> you specify how many you want. See, for example, the stdlib `heapq`
> module's `nlargest()` function.
>
> https://docs.python.org/3/library/heapq.html#heapq.nlargest
>
> "top-k" comes from the ML community where this function is used to
> evaluate classification models (`k` instead of `n` being largely an
> accident of history, I imagine). In many classification problems, the
> number of classes is very large, and they are very related to each other.
> For example, ImageNet has a lot of different dog breeds broken out as
> separate classes. In order to get a more balanced view of the relative
> performance of the classification models, you often want to check whether
> the correct class is in the top 5 classes (or whatever `k` is appropriate)
> that the model predicted for the example, not just the one class that the
> model says is the most likely. "5 largest" doesn't really work in the
> sentences that one usually writes when talking about ML classifiers; they
> are talking about the 5 classes that are associated with the 5 largest
> values from the predictor, not the values themselves. So "top k" is what
> gets used in ML discussions, and that transfers over to the name of the
> function in ML libraries.
>
> It is a top-down reflection of the higher level thing that people want to
> compute (in that context) rather than a bottom-up description of how the
> function is manipulating the input, if that makes sense. Either one is a
> valid way to name things. There is a lot to be said for numpy's
> domain-agnostic nature that we should prefer the bottom-up description
> style of naming. However, we are also in the midst of a diversifying
> ecosystem of array libraries, largely driven by the ML domain, and adopting
> some of that terminology when we try to enhance our interoperability with
> those libraries is also a factor to be considered.
>
> --
> Robert Kern
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

Reply via email to