In [10]: pd.options.display.max_rows=10 In [13]: np.random.seed(1234)
In [14]: c = np.random.randint(0,32,size=100000) In [15]: v = np.arange(100000) In [16]: df = DataFrame({'v' : v, 'c' : c}) In [17]: df Out[17]: c v 0 15 0 1 19 1 2 6 2 3 21 3 4 12 4 ... .. ... 99995 7 99995 99996 2 99996 99997 27 99997 99998 28 99998 99999 7 99999 [100000 rows x 2 columns] In [19]: df.groupby('c').count() Out[19]: v c 0 3136 1 3229 2 3093 3 3121 4 3041 .. ... 27 3128 28 3063 29 3147 30 3073 31 3090 [32 rows x 1 columns] In [20]: %timeit df.groupby('c').count() 100 loops, best of 3: 2 ms per loop In [21]: %timeit df.groupby('c').mean() 100 loops, best of 3: 2.39 ms per loop In [22]: df.groupby('c').mean() Out[22]: v c 0 49883.384885 1 50233.692165 2 48634.116069 3 50811.743992 4 50505.368629 .. ... 27 49715.349425 28 50363.501469 29 50485.395933 30 50190.155223 31 50691.041748 [32 rows x 1 columns] On Sat, Feb 13, 2016 at 1:29 PM, <josef.p...@gmail.com> wrote: > > > On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane <allanhald...@gmail.com> > wrote: > >> Sorry, to reply to myself here, but looking at it with fresh eyes maybe >> the performance of the naive version isn't too bad. Here's a comparison of >> the naive vs a better implementation: >> >> def split_classes_naive(c, v): >> return [v[c == u] for u in unique(c)] >> >> def split_classes(c, v): >> perm = c.argsort() >> csrt = c[perm] >> div = where(csrt[1:] != csrt[:-1])[0] + 1 >> return [v[x] for x in split(perm, div)] >> >> >>> c = randint(0,32,size=100000) >> >>> v = arange(100000) >> >>> %timeit split_classes_naive(c,v) >> 100 loops, best of 3: 8.4 ms per loop >> >>> %timeit split_classes(c,v) >> 100 loops, best of 3: 4.79 ms per loop >> > > The usecases I recently started to target for similar things is 1 Million > or more rows and 10000 uniques in the labels. > The second version should be faster for large number of uniques, I guess. > > Overall numpy is falling far behind pandas in terms of simple groupby > operations. bincount and histogram (IIRC) worked for some cases but are > rather limited. > > reduce_at looks nice for cases where it applies. > > In contrast to the full sized labels in the original post, I only know of > applications where the labels are 1-D corresponding to rows or columns. > > Josef > > > >> >> In any case, maybe it is useful to Sergio or others. >> >> Allan >> >> >> On 02/13/2016 12:11 PM, Allan Haldane wrote: >> >>> I've had a pretty similar idea for a new indexing function >>> 'split_classes' which would help in your case, which essentially does >>> >>> def split_classes(c, v): >>> return [v[c == u] for u in unique(c)] >>> >>> Your example could be coded as >>> >>> >>> [sum(c) for c in split_classes(label, data)] >>> [9, 12, 15] >>> >>> I feel I've come across the need for such a function often enough that >>> it might be generally useful to people as part of numpy. The >>> implementation of split_classes above has pretty poor performance >>> because it creates many temporary boolean arrays, so my plan for a PR >>> was to have a speedy version of it that uses a single pass through v. >>> (I often wanted to use this function on large datasets). >>> >>> If anyone has any comments on the idea (good idea. bad idea?) I'd love >>> to hear. >>> >>> I have some further notes and examples here: >>> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21 >>> >>> Allan >>> >>> On 02/12/2016 09:40 AM, Sérgio wrote: >>> >>>> Hello, >>>> >>>> This is my first e-mail, I will try to make the idea simple. >>>> >>>> Similar to masked array it would be interesting to use a label array to >>>> guide operations. >>>> >>>> Ex.: >>>> >>> x >>>> labelled_array(data = >>>> [[0 1 2] >>>> [3 4 5] >>>> [6 7 8]], >>>> label = >>>> [[0 1 2] >>>> [0 1 2] >>>> [0 1 2]]) >>>> >>>> >>> sum(x) >>>> array([9, 12, 15]) >>>> >>>> The operations would create a new axis for label indexing. >>>> >>>> You could think of it as a collection of masks, one for each label. >>>> >>>> I don't know a way to make something like this efficiently without a >>>> loop. Just wondering... >>>> >>>> Sérgio. >>>> >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> >>> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion