On Sat, Feb 13, 2016 at 1:42 PM, Jeff Reback <jeffreb...@gmail.com> wrote:

> These operations get slower as the number of groups increase, but with a
> faster function (e.g. the standard ones which are cythonized), the
> constant on
> the increase is pretty low.
>
> In [23]: c = np.random.randint(0,10000,size=100000)
>
> In [24]: df = DataFrame({'v' : v, 'c' : c})
>
> In [25]: %timeit df.groupby('c').count()
> 100 loops, best of 3: 3.18 ms per loop
>
> In [26]: len(df.groupby('c').count())
> Out[26]: 10000
>
> In [27]: df.groupby('c').count()
> Out[27]:
>        v
> c
> 0      9
> 1     11
> 2      7
> 3      8
> 4     16
> ...   ..
> 9995  11
> 9996  13
> 9997  13
> 9998   7
> 9999  10
>
> [10000 rows x 1 columns]
>
>
One other difference across usecases is whether this is a single operation,
or we want to optimize the data format for a large number of different
calculations.  (We have both cases in statsmodels.)

In the latter case it's worth spending some extra computational effort on
rearranging the data to be either sorted or in lists of arrays, (I guess
without having done any timings).

Josef




>
> On Sat, Feb 13, 2016 at 1:39 PM, Jeff Reback <jeffreb...@gmail.com> wrote:
>
>> In [10]: pd.options.display.max_rows=10
>>
>> In [13]: np.random.seed(1234)
>>
>> In [14]: c = np.random.randint(0,32,size=100000)
>>
>> In [15]: v = np.arange(100000)
>>
>> In [16]: df = DataFrame({'v' : v, 'c' : c})
>>
>> In [17]: df
>> Out[17]:
>>         c      v
>> 0      15      0
>> 1      19      1
>> 2       6      2
>> 3      21      3
>> 4      12      4
>> ...    ..    ...
>> 99995   7  99995
>> 99996   2  99996
>> 99997  27  99997
>> 99998  28  99998
>> 99999   7  99999
>>
>> [100000 rows x 2 columns]
>>
>> In [19]: df.groupby('c').count()
>> Out[19]:
>>        v
>> c
>> 0   3136
>> 1   3229
>> 2   3093
>> 3   3121
>> 4   3041
>> ..   ...
>> 27  3128
>> 28  3063
>> 29  3147
>> 30  3073
>> 31  3090
>>
>> [32 rows x 1 columns]
>>
>> In [20]: %timeit df.groupby('c').count()
>> 100 loops, best of 3: 2 ms per loop
>>
>> In [21]: %timeit df.groupby('c').mean()
>> 100 loops, best of 3: 2.39 ms per loop
>>
>> In [22]: df.groupby('c').mean()
>> Out[22]:
>>                v
>> c
>> 0   49883.384885
>> 1   50233.692165
>> 2   48634.116069
>> 3   50811.743992
>> 4   50505.368629
>> ..           ...
>> 27  49715.349425
>> 28  50363.501469
>> 29  50485.395933
>> 30  50190.155223
>> 31  50691.041748
>>
>> [32 rows x 1 columns]
>>
>>
>> On Sat, Feb 13, 2016 at 1:29 PM, <josef.p...@gmail.com> wrote:
>>
>>>
>>>
>>> On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane <allanhald...@gmail.com>
>>> wrote:
>>>
>>>> Sorry, to reply to myself here, but looking at it with fresh eyes maybe
>>>> the performance of the naive version isn't too bad. Here's a comparison of
>>>> the naive vs a better implementation:
>>>>
>>>> def split_classes_naive(c, v):
>>>>     return [v[c == u] for u in unique(c)]
>>>>
>>>> def split_classes(c, v):
>>>>     perm = c.argsort()
>>>>     csrt = c[perm]
>>>>     div = where(csrt[1:] != csrt[:-1])[0] + 1
>>>>     return [v[x] for x in split(perm, div)]
>>>>
>>>> >>> c = randint(0,32,size=100000)
>>>> >>> v = arange(100000)
>>>> >>> %timeit split_classes_naive(c,v)
>>>> 100 loops, best of 3: 8.4 ms per loop
>>>> >>> %timeit split_classes(c,v)
>>>> 100 loops, best of 3: 4.79 ms per loop
>>>>
>>>
>>> The usecases I recently started to target for similar things is 1
>>> Million or more rows and 10000 uniques in the labels.
>>> The second version should be faster for large number of uniques, I guess.
>>>
>>> Overall numpy is falling far behind pandas in terms of simple groupby
>>> operations. bincount and histogram (IIRC) worked for some cases but are
>>> rather limited.
>>>
>>> reduce_at looks nice for cases where it applies.
>>>
>>> In contrast to the full sized labels in the original post, I only know
>>> of applications where the labels are 1-D corresponding to rows or columns.
>>>
>>> Josef
>>>
>>>
>>>
>>>>
>>>> In any case, maybe it is useful to Sergio or others.
>>>>
>>>> Allan
>>>>
>>>>
>>>> On 02/13/2016 12:11 PM, Allan Haldane wrote:
>>>>
>>>>> I've had a pretty similar idea for a new indexing function
>>>>> 'split_classes' which would help in your case, which essentially does
>>>>>
>>>>>      def split_classes(c, v):
>>>>>          return [v[c == u] for u in unique(c)]
>>>>>
>>>>> Your example could be coded as
>>>>>
>>>>>      >>> [sum(c) for c in split_classes(label, data)]
>>>>>      [9, 12, 15]
>>>>>
>>>>> I feel I've come across the need for such a function often enough that
>>>>> it might be generally useful to people as part of numpy. The
>>>>> implementation of split_classes above has pretty poor performance
>>>>> because it creates many temporary boolean arrays, so my plan for a PR
>>>>> was to have a speedy version of it that uses a single pass through v.
>>>>> (I often wanted to use this function on large datasets).
>>>>>
>>>>> If anyone has any comments on the idea (good idea. bad idea?) I'd love
>>>>> to hear.
>>>>>
>>>>> I have some further notes and examples here:
>>>>> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
>>>>>
>>>>> Allan
>>>>>
>>>>> On 02/12/2016 09:40 AM, Sérgio wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> This is my first e-mail, I will try to make the idea simple.
>>>>>>
>>>>>> Similar to masked array it would be interesting to use a label array
>>>>>> to
>>>>>> guide operations.
>>>>>>
>>>>>> Ex.:
>>>>>>  >>> x
>>>>>> labelled_array(data =
>>>>>>   [[0 1 2]
>>>>>>   [3 4 5]
>>>>>>   [6 7 8]],
>>>>>>                          label =
>>>>>>   [[0 1 2]
>>>>>>   [0 1 2]
>>>>>>   [0 1 2]])
>>>>>>
>>>>>>  >>> sum(x)
>>>>>> array([9, 12, 15])
>>>>>>
>>>>>> The operations would create a new axis for label indexing.
>>>>>>
>>>>>> You could think of it as a collection of masks, one for each label.
>>>>>>
>>>>>> I don't know a way to make something like this efficiently without a
>>>>>> loop. Just wondering...
>>>>>>
>>>>>> Sérgio.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> NumPy-Discussion mailing list
>>>>>> NumPy-Discussion@scipy.org
>>>>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>>
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion@scipy.org
>>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to