On Wed, Mar 15, 2017 at 10:39 AM, Steven D'Aprano <st...@pearwood.info> wrote:
> > But I can imagine an occasional need to, e.g. "find outliers." However, > > that is not hard to spell as `mycounter.most_common()[-1*N:]`. Or if > your > > program does this often, write a utility function `find_outliers(...)` > > That's not how you find outliers :-) > Just because a data point is uncommon doesn't mean it is an outlier. > That's kinda *by definition* what an outlier is in categorical data! E.g.: In [1]: from glob import glob In [2]: from collections import Counter In [3]: names = Counter() In [4]: for fname in glob('babynames/yob*.txt'): ...: for line in open(fname): ...: name, sex, num = line.strip().split(',') ...: num = int(num) ...: names[name] += num ...: In [5]: names.most_common(3) Out[5]: [('James', 5086540), ('John', 5073452), ('Robert', 4795444)] In [6]: rare_names = names.most_common()[-3:] In [7]: rare_names Out[7]: [('Zyerre', 5), ('Zylas', 5), ('Zytavion', 5)] In [8]: sum(names.values()) # nicer would be `names.total` Out[8]: 326086290 This isn't exactly statistics, but it's like your product example. There are infinitely many random strings that occurred zero times among US births. But a "rare name" is one that occurred at least once, not one of these zero-occurring possible strings. I realize from my example, however, that I'm probably more interested in the actual uncommonality, not the specific `.least_common()`. I.e. I'd like to know which names occurred fewer than 10 times... but I don't know how many items that will include. Or as a percentage, which names occur in fewer than 0.01% of births? I don't think there's any good reason to want to find the "least common" > values in a statistics context, but there might be other use-cases for > it. For example, suppose we are interested in the *least* popular > products being sold: > > Counter(order.item for order in orders) > > > We can get the best selling products easily, but not the duds that don't > sell much at all. > > However, the problem is that what we really need to see is the items > that don't sell at all (count=0), and they won't show up! So I think > that this is not actually a useful feature. > > > > 2) Undefined behavior when using Counter.most_common: > > > 'c', 'c']), when calling c.most_common(3), there are more than 3 "most > > > common" elements in c and c.most_common(3) will not always return the > > > same list, since there is no defined total order on the elements in c. > > > > > Should this be mentioned in the documentation? > > > > > > > +1. I'd definitely support adding this point to the documentation. > > The docs already say that "Elements with equal counts are ordered > arbitrarily" so I'm not sure what more is needed. > > > -- > Steve > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/