Re: [Numpy-discussion] [Suggestion] Labelled Array

Jeff Reback Sat, 13 Feb 2016 10:40:27 -0800

In [10]: pd.options.display.max_rows=10

In [13]: np.random.seed(1234)


In [14]: c = np.random.randint(0,32,size=100000)

In [15]: v = np.arange(100000)

In [16]: df = DataFrame({'v' : v, 'c' : c})

In [17]: df
Out[17]:
        c      v
0      15      0
1      19      1
2       6      2
3      21      3
4      12      4
...    ..    ...
99995   7  99995
99996   2  99996
99997  27  99997
99998  28  99998
99999   7  99999

[100000 rows x 2 columns]

In [19]: df.groupby('c').count()
Out[19]:
       v
c
0   3136
1   3229
2   3093
3   3121
4   3041
..   ...
27  3128
28  3063
29  3147
30  3073
31  3090

[32 rows x 1 columns]

In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop

In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop

In [22]: df.groupby('c').mean()
Out[22]:
               v
c
0   49883.384885
1   50233.692165
2   48634.116069
3   50811.743992
4   50505.368629
..           ...
27  49715.349425
28  50363.501469
29  50485.395933
30  50190.155223
31  50691.041748

[32 rows x 1 columns]


On Sat, Feb 13, 2016 at 1:29 PM, <josef.p...@gmail.com> wrote:

>
>
> On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane <allanhald...@gmail.com>
> wrote:
>
>> Sorry, to reply to myself here, but looking at it with fresh eyes maybe
>> the performance of the naive version isn't too bad. Here's a comparison of
>> the naive vs a better implementation:
>>
>> def split_classes_naive(c, v):
>>     return [v[c == u] for u in unique(c)]
>>
>> def split_classes(c, v):
>>     perm = c.argsort()
>>     csrt = c[perm]
>>     div = where(csrt[1:] != csrt[:-1])[0] + 1
>>     return [v[x] for x in split(perm, div)]
>>
>> >>> c = randint(0,32,size=100000)
>> >>> v = arange(100000)
>> >>> %timeit split_classes_naive(c,v)
>> 100 loops, best of 3: 8.4 ms per loop
>> >>> %timeit split_classes(c,v)
>> 100 loops, best of 3: 4.79 ms per loop
>>
>
> The usecases I recently started to target for similar things is 1 Million
> or more rows and 10000 uniques in the labels.
> The second version should be faster for large number of uniques, I guess.
>
> Overall numpy is falling far behind pandas in terms of simple groupby
> operations. bincount and histogram (IIRC) worked for some cases but are
> rather limited.
>
> reduce_at looks nice for cases where it applies.
>
> In contrast to the full sized labels in the original post, I only know of
> applications where the labels are 1-D corresponding to rows or columns.
>
> Josef
>
>
>
>>
>> In any case, maybe it is useful to Sergio or others.
>>
>> Allan
>>
>>
>> On 02/13/2016 12:11 PM, Allan Haldane wrote:
>>
>>> I've had a pretty similar idea for a new indexing function
>>> 'split_classes' which would help in your case, which essentially does
>>>
>>>      def split_classes(c, v):
>>>          return [v[c == u] for u in unique(c)]
>>>
>>> Your example could be coded as
>>>
>>>      >>> [sum(c) for c in split_classes(label, data)]
>>>      [9, 12, 15]
>>>
>>> I feel I've come across the need for such a function often enough that
>>> it might be generally useful to people as part of numpy. The
>>> implementation of split_classes above has pretty poor performance
>>> because it creates many temporary boolean arrays, so my plan for a PR
>>> was to have a speedy version of it that uses a single pass through v.
>>> (I often wanted to use this function on large datasets).
>>>
>>> If anyone has any comments on the idea (good idea. bad idea?) I'd love
>>> to hear.
>>>
>>> I have some further notes and examples here:
>>> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
>>>
>>> Allan
>>>
>>> On 02/12/2016 09:40 AM, Sérgio wrote:
>>>
>>>> Hello,
>>>>
>>>> This is my first e-mail, I will try to make the idea simple.
>>>>
>>>> Similar to masked array it would be interesting to use a label array to
>>>> guide operations.
>>>>
>>>> Ex.:
>>>>  >>> x
>>>> labelled_array(data =
>>>>   [[0 1 2]
>>>>   [3 4 5]
>>>>   [6 7 8]],
>>>>                          label =
>>>>   [[0 1 2]
>>>>   [0 1 2]
>>>>   [0 1 2]])
>>>>
>>>>  >>> sum(x)
>>>> array([9, 12, 15])
>>>>
>>>> The operations would create a new axis for label indexing.
>>>>
>>>> You could think of it as a collection of masks, one for each label.
>>>>
>>>> I don't know a way to make something like this efficiently without a
>>>> loop. Just wondering...
>>>>
>>>> Sérgio.
>>>>
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion@scipy.org
>>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>>
>>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] [Suggestion] Labelled Array

Reply via email to