These operations get slower as the number of groups increase, but with a faster function (e.g. the standard ones which are cythonized), the constant on the increase is pretty low.
In [23]: c = np.random.randint(0,10000,size=100000) In [24]: df = DataFrame({'v' : v, 'c' : c}) In [25]: %timeit df.groupby('c').count() 100 loops, best of 3: 3.18 ms per loop In [26]: len(df.groupby('c').count()) Out[26]: 10000 In [27]: df.groupby('c').count() Out[27]: v c 0 9 1 11 2 7 3 8 4 16 ... .. 9995 11 9996 13 9997 13 9998 7 9999 10 [10000 rows x 1 columns] On Sat, Feb 13, 2016 at 1:39 PM, Jeff Reback <jeffreb...@gmail.com> wrote: > In [10]: pd.options.display.max_rows=10 > > In [13]: np.random.seed(1234) > > In [14]: c = np.random.randint(0,32,size=100000) > > In [15]: v = np.arange(100000) > > In [16]: df = DataFrame({'v' : v, 'c' : c}) > > In [17]: df > Out[17]: > c v > 0 15 0 > 1 19 1 > 2 6 2 > 3 21 3 > 4 12 4 > ... .. ... > 99995 7 99995 > 99996 2 99996 > 99997 27 99997 > 99998 28 99998 > 99999 7 99999 > > [100000 rows x 2 columns] > > In [19]: df.groupby('c').count() > Out[19]: > v > c > 0 3136 > 1 3229 > 2 3093 > 3 3121 > 4 3041 > .. ... > 27 3128 > 28 3063 > 29 3147 > 30 3073 > 31 3090 > > [32 rows x 1 columns] > > In [20]: %timeit df.groupby('c').count() > 100 loops, best of 3: 2 ms per loop > > In [21]: %timeit df.groupby('c').mean() > 100 loops, best of 3: 2.39 ms per loop > > In [22]: df.groupby('c').mean() > Out[22]: > v > c > 0 49883.384885 > 1 50233.692165 > 2 48634.116069 > 3 50811.743992 > 4 50505.368629 > .. ... > 27 49715.349425 > 28 50363.501469 > 29 50485.395933 > 30 50190.155223 > 31 50691.041748 > > [32 rows x 1 columns] > > > On Sat, Feb 13, 2016 at 1:29 PM, <josef.p...@gmail.com> wrote: > >> >> >> On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane <allanhald...@gmail.com> >> wrote: >> >>> Sorry, to reply to myself here, but looking at it with fresh eyes maybe >>> the performance of the naive version isn't too bad. Here's a comparison of >>> the naive vs a better implementation: >>> >>> def split_classes_naive(c, v): >>> return [v[c == u] for u in unique(c)] >>> >>> def split_classes(c, v): >>> perm = c.argsort() >>> csrt = c[perm] >>> div = where(csrt[1:] != csrt[:-1])[0] + 1 >>> return [v[x] for x in split(perm, div)] >>> >>> >>> c = randint(0,32,size=100000) >>> >>> v = arange(100000) >>> >>> %timeit split_classes_naive(c,v) >>> 100 loops, best of 3: 8.4 ms per loop >>> >>> %timeit split_classes(c,v) >>> 100 loops, best of 3: 4.79 ms per loop >>> >> >> The usecases I recently started to target for similar things is 1 Million >> or more rows and 10000 uniques in the labels. >> The second version should be faster for large number of uniques, I guess. >> >> Overall numpy is falling far behind pandas in terms of simple groupby >> operations. bincount and histogram (IIRC) worked for some cases but are >> rather limited. >> >> reduce_at looks nice for cases where it applies. >> >> In contrast to the full sized labels in the original post, I only know of >> applications where the labels are 1-D corresponding to rows or columns. >> >> Josef >> >> >> >>> >>> In any case, maybe it is useful to Sergio or others. >>> >>> Allan >>> >>> >>> On 02/13/2016 12:11 PM, Allan Haldane wrote: >>> >>>> I've had a pretty similar idea for a new indexing function >>>> 'split_classes' which would help in your case, which essentially does >>>> >>>> def split_classes(c, v): >>>> return [v[c == u] for u in unique(c)] >>>> >>>> Your example could be coded as >>>> >>>> >>> [sum(c) for c in split_classes(label, data)] >>>> [9, 12, 15] >>>> >>>> I feel I've come across the need for such a function often enough that >>>> it might be generally useful to people as part of numpy. The >>>> implementation of split_classes above has pretty poor performance >>>> because it creates many temporary boolean arrays, so my plan for a PR >>>> was to have a speedy version of it that uses a single pass through v. >>>> (I often wanted to use this function on large datasets). >>>> >>>> If anyone has any comments on the idea (good idea. bad idea?) I'd love >>>> to hear. >>>> >>>> I have some further notes and examples here: >>>> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21 >>>> >>>> Allan >>>> >>>> On 02/12/2016 09:40 AM, Sérgio wrote: >>>> >>>>> Hello, >>>>> >>>>> This is my first e-mail, I will try to make the idea simple. >>>>> >>>>> Similar to masked array it would be interesting to use a label array to >>>>> guide operations. >>>>> >>>>> Ex.: >>>>> >>> x >>>>> labelled_array(data = >>>>> [[0 1 2] >>>>> [3 4 5] >>>>> [6 7 8]], >>>>> label = >>>>> [[0 1 2] >>>>> [0 1 2] >>>>> [0 1 2]]) >>>>> >>>>> >>> sum(x) >>>>> array([9, 12, 15]) >>>>> >>>>> The operations would create a new axis for label indexing. >>>>> >>>>> You could think of it as a collection of masks, one for each label. >>>>> >>>>> I don't know a way to make something like this efficiently without a >>>>> loop. Just wondering... >>>>> >>>>> Sérgio. >>>>> >>>>> >>>>> _______________________________________________ >>>>> NumPy-Discussion mailing list >>>>> NumPy-Discussion@scipy.org >>>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion >>>>> >>>>> >>>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> https://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion