Impressive!

Possibly there's still a case for including a 'groupby' function in numpy itself since it's a generally useful operation, but I do see less of a need given the nice pandas functionality.

At least, next time someone asks a stackoverflow question like the ones below someone should tell them to use pandas!

(copied from my gist for future list reference).

http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy
http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-values-in-the-array-a-condition/31484134#31484134
http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-on-values-in-the-array
http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smaller-arrays-in-python
http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-to-a-condition-in-numpy

Allan


On 02/13/2016 01:39 PM, Jeff Reback wrote:
In [10]: pd.options.display.max_rows=10

In [13]: np.random.seed(1234)

In [14]: c = np.random.randint(0,32,size=100000)

In [15]: v = np.arange(100000)

In [16]: df = DataFrame({'v' : v, 'c' : c})

In [17]: df
Out[17]:
         c      v
0      15      0
1      19      1
2       6      2
3      21      3
4      12      4
...    ..    ...
99995   7  99995
99996   2  99996
99997  27  99997
99998  28  99998
99999   7  99999

[100000 rows x 2 columns]

In [19]: df.groupby('c').count()
Out[19]:
        v
c
0   3136
1   3229
2   3093
3   3121
4   3041
..   ...
27  3128
28  3063
29  3147
30  3073
31  3090

[32 rows x 1 columns]

In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop

In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop

In [22]: df.groupby('c').mean()
Out[22]:
                v
c
0   49883.384885
1   50233.692165
2   48634.116069
3   50811.743992
4   50505.368629
..           ...
27  49715.349425
28  50363.501469
29  50485.395933
30  50190.155223
31  50691.041748

[32 rows x 1 columns]


On Sat, Feb 13, 2016 at 1:29 PM, <josef.p...@gmail.com
<mailto:josef.p...@gmail.com>> wrote:



    On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane
    <allanhald...@gmail.com <mailto:allanhald...@gmail.com>> wrote:

        Sorry, to reply to myself here, but looking at it with fresh
        eyes maybe the performance of the naive version isn't too bad.
        Here's a comparison of the naive vs a better implementation:

        def split_classes_naive(c, v):
             return [v[c == u] for u in unique(c)]

        def split_classes(c, v):
             perm = c.argsort()
             csrt = c[perm]
             div = where(csrt[1:] != csrt[:-1])[0] + 1
             return [v[x] for x in split(perm, div)]

        >>> c = randint(0,32,size=100000)
        >>> v = arange(100000)
        >>> %timeit split_classes_naive(c,v)
        100 loops, best of 3: 8.4 ms per loop
        >>> %timeit split_classes(c,v)
        100 loops, best of 3: 4.79 ms per loop


    The usecases I recently started to target for similar things is 1
    Million or more rows and 10000 uniques in the labels.
    The second version should be faster for large number of uniques, I
    guess.

    Overall numpy is falling far behind pandas in terms of simple
    groupby operations. bincount and histogram (IIRC) worked for some
    cases but are rather limited.

    reduce_at looks nice for cases where it applies.

    In contrast to the full sized labels in the original post, I only
    know of applications where the labels are 1-D corresponding to rows
    or columns.

    Josef


        In any case, maybe it is useful to Sergio or others.

        Allan


        On 02/13/2016 12:11 PM, Allan Haldane wrote:

            I've had a pretty similar idea for a new indexing function
            'split_classes' which would help in your case, which
            essentially does

                  def split_classes(c, v):
                      return [v[c == u] for u in unique(c)]

            Your example could be coded as

                  >>> [sum(c) for c in split_classes(label, data)]
                  [9, 12, 15]

            I feel I've come across the need for such a function often
            enough that
            it might be generally useful to people as part of numpy. The
            implementation of split_classes above has pretty poor
            performance
            because it creates many temporary boolean arrays, so my plan
            for a PR
            was to have a speedy version of it that uses a single pass
            through v.
            (I often wanted to use this function on large datasets).

            If anyone has any comments on the idea (good idea. bad
            idea?) I'd love
            to hear.

            I have some further notes and examples here:
            https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21

            Allan

            On 02/12/2016 09:40 AM, Sérgio wrote:

                Hello,

                This is my first e-mail, I will try to make the idea simple.

                Similar to masked array it would be interesting to use a
                label array to
                guide operations.

                Ex.:
                  >>> x
                labelled_array(data =
                   [[0 1 2]
                   [3 4 5]
                   [6 7 8]],
                                          label =
                   [[0 1 2]
                   [0 1 2]
                   [0 1 2]])

                  >>> sum(x)
                array([9, 12, 15])

                The operations would create a new axis for label indexing.

                You could think of it as a collection of masks, one for
                each label.

                I don't know a way to make something like this
                efficiently without a
                loop. Just wondering...

                Sérgio.


                _______________________________________________
                NumPy-Discussion mailing list
                NumPy-Discussion@scipy.org
                <mailto:NumPy-Discussion@scipy.org>
                https://mail.scipy.org/mailman/listinfo/numpy-discussion



        _______________________________________________
        NumPy-Discussion mailing list
        NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
        https://mail.scipy.org/mailman/listinfo/numpy-discussion



    _______________________________________________
    NumPy-Discussion mailing list
    NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
    https://mail.scipy.org/mailman/listinfo/numpy-discussion




_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to