On Sun, Jul 1, 2018 at 7:12 PM, David Mertz <me...@gnosis.cx> wrote: > Michael changed from set to list at my urging. A list is more general. A > groupby in Pandas or SQL does not enforce uniqueness, but DOES preserve > order. >
<snip> It really is better to construct the collection using lists—in the fully general manner—and then only throw away the generality when that appropriate. well, yes -- if there were only one option, then list is pretty obvious. but whether converting to sets after the fact is just as good or not -- I don't think so. It's only just as good if you think of it as a one-time operation -- process a bunch of data all at once, and get back a dict with the results. But I'm thinking of it in a different way: Create a custom class derived from dict that you can add stuff to at any time --much more like the current examples in the collections module. If you simply want a groupby function that returns a regular dict, then you need a utility function (or a few), not a new class. If you are making a class that enforces the the values to be a collection of items, then list is the obvious default, but of someone wants a set -- they want it built in to the class, not converted after the fact. I've extended my prototype to do just that: class Grouping(dict): ... def __init__(self, iterable=(), *, collection=list): "collection" is a class that's either a Mutable Sequence (has .append and .extend methods) or Set (has .add and .update methods). Once you create a Grouping instance, the collection class you pass in is used everywhere. I've put the prototype up on gitHub if anyone wants to take a look, try it out, suggest changes, etc: https://github.com/PythonCHB/grouper (and enclosed here) Note that I am NOT proposing this particular implementation or names, or anything. I welcome feedback on the implementation, API and naming scheme, but it would be great if we could all be clear on whether the critique is of the idea or of the implementation. This particular implementation uses pretty hack meta-class magic (or the type constructor anyway) -- if something set-like is passed in, it creates a subclass that adds .append and .extend methods, so that the rest of the code doesn't have to special case. Not sure if that's a good idea, it feels pretty kludgy -- but kinda fun to write. It also needs more test cases and example use cases for sure. And before we go much farther with this discussion, it would be great to see some more real-world use cases, if anyone has some in mind. -CHB ------- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
from collections.abc import Mapping import heapq # extra methods to tack on to set to make it "act" like a list. extra_methods = {"append": lambda self, value: self.add(value), "extend": lambda self, sequence: self.update(set(sequence))} class Grouping(dict): """ Dict subclass for grouping elements of a sequence. The values in the dict are a list of all items that have corresponded to a given key essentially, adding an new item is the same as: dict.setdefault(key, []).append(value) In other words, for each item added: grouper['key'] = 'value' If they key is already there, the value is added to the corresponding list. If the key is not already there, a new entry is added, with the key as the key, and the value is a new list with the single entry of the value in it. The __init__ (and update) can take either a mapping of keys to lists, or an iterable of (key, value) tuples. If the initial data is not in exactly the form desired, an generator expression can be used to "transform" the input. For example: >>> Grouping(((c.casefold(), c) for c in 'AbBa')) Grouping({'a': ['A', 'a'], 'b': ['b', 'B']}) """ def __init__(self, iterable=(), *, collection=list): """ Create a new Grouping object. :param iterable: an iterable or mapping with initial data. """ if hasattr(collection, "append") and hasattr(collection, "extend"): self.collection = list elif hasattr(collection, "add") and hasattr(collection, "update"): # this is very kludgy -- adding append and extend methods to a # set or set-like object self.collection = type("appendset", (set,), extra_methods) else: raise TypeError("collection has to be a MutableSequence or set-like object") super().__init__() self.update(iterable) # Override a few dict methods def __setitem__(self, key, value): self.setdefault(key, self.collection()).append(value) def __repr__(self): return f"Grouping({super().__repr__()})" @classmethod def fromkeys(cls, iterable, v=()): return cls(dict.fromkeys(iterable, self.collection(v))) def update(self, iterable=(), key=None): '''Extend groups with elements from an iterable or with key-group items from a dictionary or another Grouping instance. The ``key`` function is ignored for dictionaries and Groupings. >>> g = Grouping('AbBa', key=str.casefold) >>> g.update(['apple', 'banana'], key=lambda s: s[0]) >>> g['a'] ['A', 'a', 'apple'] ''' if isinstance(iterable, Mapping): for k, g in iterable.items(): self.setdefault(k, self.collection()).extend(g) else: for k, g in iterable: self[k] = g def map(self, func): """ Apply a function to each element in every group. """ return {k: [func(v) for v in g] for k, g in self.items()} def aggregate(self, func): """ Apply a function to each group. >>> g = Grouping(((c.casefold(), c) for c in 'AbBaAa')) >>> g.aggregate(''.join) # concatenate {'a': 'AaAa', 'b': 'bB'} >>> g.aggregate(set) # uniques {'a': {'A', 'a'}, 'b': {'B', 'b'}} >>> g.aggregate(Counter) # counts {'a': Counter({'A': 2, 'a': 2}), 'b': Counter({'B': 1, 'b': 1})} Grouping.aggregate behaves similarly to the "map-reduce" pattern of programming. Grouping(iterable).aggregate(reducer) """ return {k: func(g) for k, g in self.items()} def most_common(self, n=None): '''List the ``n`` largest groups from largest to smallest. If ``n`` is ``None``, then list all groups. ''' keyfunc = lambda item: len(item[1]) if n is None: return sorted(self.items(), key=keyfunc, reverse=True) return heapq.nlargest(n, self.items(), key=keyfunc)
""" test code for grouper class (run with pytest or maybe nose) """ from grouper import Grouping # example data from the mailing list. student_school_list = [('Fred', 'SchoolA'), ('Bob', 'SchoolB'), ('Mary', 'SchoolA'), ('Jane', 'SchoolB'), ('Nancy', 'SchoolC'), ] student_school_dict = {'SchoolA': ['Fred', 'Mary'], 'SchoolB': ['Bob', 'Jane'], 'SchoolC': ['Nancy'] } def test_init_empty(): gr = Grouping() assert len(gr) == 0 def test_add_one_item(): gr = Grouping() gr['key'] = 'value' assert len(gr) == 1 assert gr['key'] == ['value'] def test_example_loop(): gr = Grouping() for student, school in student_school_list: gr[school] = student assert len(gr) == 3 assert gr['SchoolA'] == ['Fred', 'Mary'] assert gr['SchoolB'] == ['Bob', 'Jane'] assert gr['SchoolC'] == ['Nancy'] def test_constructor_list(): """ Trying to be as similar to the dict constructor as possible: We can use pass an iterable of (key, value) tuples, the keys will be what is grouped by, and the values will be in the groups. """ gr = Grouping(((item[1], item[0]) for item in student_school_list)) assert len(gr) == 3 assert gr['SchoolA'] == ['Fred', 'Mary'] assert gr['SchoolB'] == ['Bob', 'Jane'] assert gr['SchoolC'] == ['Nancy'] def test_constructor_dict(): """ Trying to be as similar to the dict constructor as possible: You can contruct with a Mapping that already has groups """ gr = Grouping(student_school_dict) assert len(gr) == 3 assert gr['SchoolA'] == ['Fred', 'Mary'] assert gr['SchoolB'] == ['Bob', 'Jane'] assert gr['SchoolC'] == ['Nancy'] def test_simple_sequence_example(): """ This was a example / use case in Michael Selik's PEP """ gr = Grouping(((c.casefold(), c) for c in 'AbBa')) assert gr == {'a': ['A', 'a'], 'b': ['b', 'B']} def test_most_common(): gr = Grouping(((c.casefold(), c) for c in 'AbBaAAbCccDe')) common = gr.most_common() assert len(common) == len(gr) common = gr.most_common(2) print(common) assert len(common) == 2 assert common == [('a', ['A', 'a', 'A', 'A']), ('b', ['b', 'B', 'b'])] ## You could also specify a custom "collection" type such as a set: def test_set_single(): gr = Grouping(collection=set) gr['key'] = 5 gr['key'] = 6 gr['key'] = 5 assert gr['key'] == set((5,6)) def test_set_all_at_once(): gr = Grouping(((c.casefold(), c) for c in 'AbBaAAbCccDe'), collection=set) print(gr) assert len(gr) == 5 assert gr['a'] == set(('a','A')) assert gr['b'] == set(('b','B')) assert gr['c'] == set(('c','C'))
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/