These operations get slower as the number of groups increase, but with a
faster function (e.g. the standard ones which are cythonized), the constant
on
the increase is pretty low.

In [23]: c = np.random.randint(0,10000,size=100000)

In [24]: df = DataFrame({'v' : v, 'c' : c})

In [25]: %timeit df.groupby('c').count()
100 loops, best of 3: 3.18 ms per loop

In [26]: len(df.groupby('c').count())
Out[26]: 10000

In [27]: df.groupby('c').count()
Out[27]:
       v
c
0      9
1     11
2      7
3      8
4     16
...   ..
9995  11
9996  13
9997  13
9998   7
9999  10

[10000 rows x 1 columns]


On Sat, Feb 13, 2016 at 1:39 PM, Jeff Reback <jeffreb...@gmail.com> wrote:

> In [10]: pd.options.display.max_rows=10
>
> In [13]: np.random.seed(1234)
>
> In [14]: c = np.random.randint(0,32,size=100000)
>
> In [15]: v = np.arange(100000)
>
> In [16]: df = DataFrame({'v' : v, 'c' : c})
>
> In [17]: df
> Out[17]:
>         c      v
> 0      15      0
> 1      19      1
> 2       6      2
> 3      21      3
> 4      12      4
> ...    ..    ...
> 99995   7  99995
> 99996   2  99996
> 99997  27  99997
> 99998  28  99998
> 99999   7  99999
>
> [100000 rows x 2 columns]
>
> In [19]: df.groupby('c').count()
> Out[19]:
>        v
> c
> 0   3136
> 1   3229
> 2   3093
> 3   3121
> 4   3041
> ..   ...
> 27  3128
> 28  3063
> 29  3147
> 30  3073
> 31  3090
>
> [32 rows x 1 columns]
>
> In [20]: %timeit df.groupby('c').count()
> 100 loops, best of 3: 2 ms per loop
>
> In [21]: %timeit df.groupby('c').mean()
> 100 loops, best of 3: 2.39 ms per loop
>
> In [22]: df.groupby('c').mean()
> Out[22]:
>                v
> c
> 0   49883.384885
> 1   50233.692165
> 2   48634.116069
> 3   50811.743992
> 4   50505.368629
> ..           ...
> 27  49715.349425
> 28  50363.501469
> 29  50485.395933
> 30  50190.155223
> 31  50691.041748
>
> [32 rows x 1 columns]
>
>
> On Sat, Feb 13, 2016 at 1:29 PM, <josef.p...@gmail.com> wrote:
>
>>
>>
>> On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane <allanhald...@gmail.com>
>> wrote:
>>
>>> Sorry, to reply to myself here, but looking at it with fresh eyes maybe
>>> the performance of the naive version isn't too bad. Here's a comparison of
>>> the naive vs a better implementation:
>>>
>>> def split_classes_naive(c, v):
>>>     return [v[c == u] for u in unique(c)]
>>>
>>> def split_classes(c, v):
>>>     perm = c.argsort()
>>>     csrt = c[perm]
>>>     div = where(csrt[1:] != csrt[:-1])[0] + 1
>>>     return [v[x] for x in split(perm, div)]
>>>
>>> >>> c = randint(0,32,size=100000)
>>> >>> v = arange(100000)
>>> >>> %timeit split_classes_naive(c,v)
>>> 100 loops, best of 3: 8.4 ms per loop
>>> >>> %timeit split_classes(c,v)
>>> 100 loops, best of 3: 4.79 ms per loop
>>>
>>
>> The usecases I recently started to target for similar things is 1 Million
>> or more rows and 10000 uniques in the labels.
>> The second version should be faster for large number of uniques, I guess.
>>
>> Overall numpy is falling far behind pandas in terms of simple groupby
>> operations. bincount and histogram (IIRC) worked for some cases but are
>> rather limited.
>>
>> reduce_at looks nice for cases where it applies.
>>
>> In contrast to the full sized labels in the original post, I only know of
>> applications where the labels are 1-D corresponding to rows or columns.
>>
>> Josef
>>
>>
>>
>>>
>>> In any case, maybe it is useful to Sergio or others.
>>>
>>> Allan
>>>
>>>
>>> On 02/13/2016 12:11 PM, Allan Haldane wrote:
>>>
>>>> I've had a pretty similar idea for a new indexing function
>>>> 'split_classes' which would help in your case, which essentially does
>>>>
>>>>      def split_classes(c, v):
>>>>          return [v[c == u] for u in unique(c)]
>>>>
>>>> Your example could be coded as
>>>>
>>>>      >>> [sum(c) for c in split_classes(label, data)]
>>>>      [9, 12, 15]
>>>>
>>>> I feel I've come across the need for such a function often enough that
>>>> it might be generally useful to people as part of numpy. The
>>>> implementation of split_classes above has pretty poor performance
>>>> because it creates many temporary boolean arrays, so my plan for a PR
>>>> was to have a speedy version of it that uses a single pass through v.
>>>> (I often wanted to use this function on large datasets).
>>>>
>>>> If anyone has any comments on the idea (good idea. bad idea?) I'd love
>>>> to hear.
>>>>
>>>> I have some further notes and examples here:
>>>> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
>>>>
>>>> Allan
>>>>
>>>> On 02/12/2016 09:40 AM, Sérgio wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> This is my first e-mail, I will try to make the idea simple.
>>>>>
>>>>> Similar to masked array it would be interesting to use a label array to
>>>>> guide operations.
>>>>>
>>>>> Ex.:
>>>>>  >>> x
>>>>> labelled_array(data =
>>>>>   [[0 1 2]
>>>>>   [3 4 5]
>>>>>   [6 7 8]],
>>>>>                          label =
>>>>>   [[0 1 2]
>>>>>   [0 1 2]
>>>>>   [0 1 2]])
>>>>>
>>>>>  >>> sum(x)
>>>>> array([9, 12, 15])
>>>>>
>>>>> The operations would create a new axis for label indexing.
>>>>>
>>>>> You could think of it as a collection of masks, one for each label.
>>>>>
>>>>> I don't know a way to make something like this efficiently without a
>>>>> loop. Just wondering...
>>>>>
>>>>> Sérgio.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list
>>>>> NumPy-Discussion@scipy.org
>>>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to