Re: [Python-ideas] Fwd: grouping / dict of lists

Chris Barker via Python-ideas Sun, 01 Jul 2018 21:59:33 -0700

On Sun, Jul 1, 2018 at 7:12 PM, David Mertz <[email protected]> wrote:

> Michael changed from set to list at my urging. A list is more general. A
> groupby in Pandas or SQL does not enforce uniqueness, but DOES preserve
> order.
>


<snip>

It really is better to construct the collection using lists—in the fully
general manner—and then only throw away the generality when that
appropriate.

well, yes -- if there were only one option, then list is pretty obvious.

but whether converting to sets after the fact is just as good or not -- I
don't think so.

It's only just as good if you think of it as a one-time operation --
process a bunch of data all at once, and get back a dict with the results.
But I'm thinking of it in a different way:

Create a custom class derived from dict that you can add stuff to at any
time --much more like the current examples in the collections module.

If you simply want a groupby function that returns a regular dict, then you
need a utility function (or a few), not a new class.

If you are making a class that enforces the the values to be a collection
of items, then list is the obvious default, but of someone wants a set --
they want it built in to the class, not converted after the fact.

I've extended my prototype to do just that:

class Grouping(dict):
   ...
    def __init__(self, iterable=(), *, collection=list):

"collection" is a class that's either a Mutable Sequence (has .append and
.extend methods) or Set (has .add and .update methods).

Once you create a Grouping instance, the collection class you pass in is
used everywhere.

I've put the prototype up on gitHub if anyone wants to take a look, try it
out, suggest changes, etc:

https://github.com/PythonCHB/grouper

(and enclosed here)

Note that I am NOT proposing this particular implementation or names, or
anything. I welcome feedback on the implementation, API and naming scheme,
but it would be great if we could all be clear on whether the critique is
of the idea or of the implementation.

This particular implementation uses pretty hack meta-class magic (or the
type constructor anyway) -- if something set-like is passed in, it creates
a subclass that adds .append and .extend methods, so that the rest of the
code doesn't have to special case. Not sure if that's a good idea, it feels
pretty kludgy -- but kinda fun to write.

It also needs more test cases and example use cases for sure.

And before we go much farther with this discussion, it would be great to
see some more real-world use cases, if anyone has some in mind.

-CHB

-------
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[email protected]

from collections.abc import Mapping
import heapq



# extra methods to tack on to set to make it "act" like a list.
extra_methods = {"append": lambda self, value: self.add(value),
                 "extend": lambda self, sequence: self.update(set(sequence))}


class Grouping(dict):
    """
    Dict subclass for grouping elements of a sequence.

    The values in the dict are a list of all items that have
    corresponded to a given key

    essentially, adding an new item is the same as:

    dict.setdefault(key, []).append(value)

    In other words, for each item added:

    grouper['key'] = 'value'

    If they key is already there, the value is added to the corresponding list.

    If the key is not already there, a new entry is added, with the key as the
    key, and the value is a new list with the single entry of the value in it.

    The __init__ (and update) can take either a mapping of keys to lists,
    or an iterable of (key, value) tuples.

    If the initial data is not in exactly the form desired, an generator
    expression can be used to "transform" the input.

    For example:

        >>> Grouping(((c.casefold(), c) for c in 'AbBa'))
        Grouping({'a': ['A', 'a'], 'b': ['b', 'B']})
    """

    def __init__(self, iterable=(), *, collection=list):
        """
        Create a new Grouping object.

        :param iterable: an iterable or mapping with initial data.

        """
        if hasattr(collection, "append") and hasattr(collection, "extend"):
            self.collection = list
        elif hasattr(collection, "add") and hasattr(collection, "update"):
            # this is very kludgy -- adding append and extend methods to a
            # set or set-like object
            self.collection = type("appendset", (set,), extra_methods)
        else:
            raise TypeError("collection has to be a MutableSequence or set-like object")
        super().__init__()
        self.update(iterable)

    # Override a few dict methods

    def __setitem__(self, key, value):
        self.setdefault(key, self.collection()).append(value)

    def __repr__(self):
        return f"Grouping({super().__repr__()})"

    @classmethod
    def fromkeys(cls, iterable, v=()):
        return cls(dict.fromkeys(iterable, self.collection(v)))

    def update(self, iterable=(), key=None):
        '''Extend groups with elements from an iterable or with
        key-group items from a dictionary or another Grouping instance.

        The ``key`` function is ignored for dictionaries and Groupings.

            >>> g = Grouping('AbBa', key=str.casefold)
            >>> g.update(['apple', 'banana'], key=lambda s: s[0])
            >>> g['a']
            ['A', 'a', 'apple']

        '''
        if isinstance(iterable, Mapping):
            for k, g in iterable.items():
                self.setdefault(k, self.collection()).extend(g)
        else:
            for k, g in iterable:
                self[k] = g

    def map(self, func):
        """
        Apply a function to each element in every group.
        """
        return {k: [func(v) for v in g] for k, g in self.items()}

    def aggregate(self, func):
        """
        Apply a function to each group.

            >>> g = Grouping(((c.casefold(), c) for c in 'AbBaAa'))
            >>> g.aggregate(''.join)    # concatenate
            {'a': 'AaAa', 'b': 'bB'}
            >>> g.aggregate(set)        # uniques
            {'a': {'A', 'a'}, 'b': {'B', 'b'}}
            >>> g.aggregate(Counter)    # counts
            {'a': Counter({'A': 2, 'a': 2}), 'b': Counter({'B': 1, 'b': 1})}

        Grouping.aggregate behaves similarly to the "map-reduce"
        pattern of programming.

            Grouping(iterable).aggregate(reducer)

        """
        return {k: func(g) for k, g in self.items()}

    def most_common(self, n=None):
        '''List the ``n`` largest groups from largest to smallest.  If
        ``n`` is ``None``, then list all groups.
        '''
        keyfunc = lambda item: len(item[1])
        if n is None:
            return sorted(self.items(), key=keyfunc, reverse=True)
        return heapq.nlargest(n, self.items(), key=keyfunc)

"""
test code for grouper class

(run with pytest or maybe nose)

"""

from grouper import Grouping

# example data from the mailing list.
student_school_list = [('Fred', 'SchoolA'),
                       ('Bob', 'SchoolB'),
                       ('Mary', 'SchoolA'),
                       ('Jane', 'SchoolB'),
                       ('Nancy', 'SchoolC'),
                       ]

student_school_dict = {'SchoolA': ['Fred', 'Mary'],
                       'SchoolB': ['Bob', 'Jane'],
                       'SchoolC': ['Nancy']
                       }


def test_init_empty():
    gr = Grouping()
    assert len(gr) == 0


def test_add_one_item():
    gr = Grouping()

    gr['key'] = 'value'

    assert len(gr) == 1
    assert gr['key'] == ['value']


def test_example_loop():
    gr = Grouping()

    for student, school in student_school_list:
        gr[school] = student

    assert len(gr) == 3

    assert gr['SchoolA'] == ['Fred', 'Mary']
    assert gr['SchoolB'] == ['Bob', 'Jane']
    assert gr['SchoolC'] == ['Nancy']


def test_constructor_list():
    """
    Trying to be as similar to the dict constructor as possible:

    We can use pass an iterable of (key, value) tuples, the keys
    will be what is grouped by, and the values will be in the groups.
    """
    gr = Grouping(((item[1], item[0]) for item in student_school_list))

    assert len(gr) == 3

    assert gr['SchoolA'] == ['Fred', 'Mary']
    assert gr['SchoolB'] == ['Bob', 'Jane']
    assert gr['SchoolC'] == ['Nancy']


def test_constructor_dict():
    """
    Trying to be as similar to the dict constructor as possible:

    You can contruct with a Mapping that already has groups
    """
    gr = Grouping(student_school_dict)

    assert len(gr) == 3

    assert gr['SchoolA'] == ['Fred', 'Mary']
    assert gr['SchoolB'] == ['Bob', 'Jane']
    assert gr['SchoolC'] == ['Nancy']


def test_simple_sequence_example():
    """
    This was a example / use case in Michael Selik's PEP
    """
    gr = Grouping(((c.casefold(), c) for c in 'AbBa'))

    assert gr == {'a': ['A', 'a'],
                  'b': ['b', 'B']}


def test_most_common():
    gr = Grouping(((c.casefold(), c) for c in 'AbBaAAbCccDe'))
    common = gr.most_common()
    assert len(common) == len(gr)

    common = gr.most_common(2)

    print(common)
    assert len(common) == 2
    assert common == [('a', ['A', 'a', 'A', 'A']), ('b', ['b', 'B', 'b'])]


## You could also specify a custom "collection" type such as a set:
def test_set_single():
    gr = Grouping(collection=set)
    gr['key'] = 5
    gr['key'] = 6
    gr['key'] = 5

    assert gr['key'] == set((5,6))


def test_set_all_at_once():
    gr = Grouping(((c.casefold(), c) for c in 'AbBaAAbCccDe'),
                  collection=set)
    print(gr)

    assert len(gr) == 5
    assert gr['a'] == set(('a','A'))
    assert gr['b'] == set(('b','B'))
    assert gr['c'] == set(('c','C'))

_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fwd: grouping / dict of lists

Reply via email to