Just something I tried with pandas: >>> image array([[[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24], [25, 26, 27, 28, 29], [30, 31, 32, 33, 34], [35, 36, 37, 38, 39]], [[40, 41, 42, 43, 44], [45, 46, 47, 48, 49], [50, 51, 52, 53, 54], [55, 56, 57, 58, 59]]]) >>> label array([[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]]) >>> dt = pd.DataFrame(np.vstack((label.ravel(), image.reshape(3, 20))).T) >>> labelled_image = dt.groupby(0) >>> labelled_image.mean().values array([[ 0, 20, 40], [ 3, 23, 43], [ 6, 26, 46], [ 9, 29, 49], [10, 30, 50], [13, 33, 53], [16, 36, 56], [19, 39, 59]]) Sergio > Date: Sat, 13 Feb 2016 22:41:13 -0500 > From: Allan Haldane <allanhald...@gmail.com> > To: numpy-discussion@scipy.org > Subject: Re: [Numpy-discussion] [Suggestion] Labelled Array > Message-ID: <56bff759.7010...@gmail.com> > Content-Type: text/plain; charset=windows-1252; format=flowed > > Impressive! > > Possibly there's still a case for including a 'groupby' function in > numpy itself since it's a generally useful operation, but I do see less > of a need given the nice pandas functionality. > > At least, next time someone asks a stackoverflow question like the ones > below someone should tell them to use pandas! > > (copied from my gist for future list reference). > > http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy > > http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-values-in-the-array-a-condition/31484134#31484134 > > http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-on-values-in-the-array > > http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smaller-arrays-in-python > > http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-to-a-condition-in-numpy > > Allan > > > On 02/13/2016 01:39 PM, Jeff Reback wrote: > > In [10]: pd.options.display.max_rows=10 > > > > In [13]: np.random.seed(1234) > > > > In [14]: c = np.random.randint(0,32,size=100000) > > > > In [15]: v = np.arange(100000) > > > > In [16]: df = DataFrame({'v' : v, 'c' : c}) > > > > In [17]: df > > Out[17]: > > c v > > 0 15 0 > > 1 19 1 > > 2 6 2 > > 3 21 3 > > 4 12 4 > > ... .. ... > > 99995 7 99995 > > 99996 2 99996 > > 99997 27 99997 > > 99998 28 99998 > > 99999 7 99999 > > > > [100000 rows x 2 columns] > > > > In [19]: df.groupby('c').count() > > Out[19]: > > v > > c > > 0 3136 > > 1 3229 > > 2 3093 > > 3 3121 > > 4 3041 > > .. ... > > 27 3128 > > 28 3063 > > 29 3147 > > 30 3073 > > 31 3090 > > > > [32 rows x 1 columns] > > > > In [20]: %timeit df.groupby('c').count() > > 100 loops, best of 3: 2 ms per loop > > > > In [21]: %timeit df.groupby('c').mean() > > 100 loops, best of 3: 2.39 ms per loop > > > > In [22]: df.groupby('c').mean() > > Out[22]: > > v > > c > > 0 49883.384885 > > 1 50233.692165 > > 2 48634.116069 > > 3 50811.743992 > > 4 50505.368629 > > .. ... > > 27 49715.349425 > > 28 50363.501469 > > 29 50485.395933 > > 30 50190.155223 > > 31 50691.041748 > > > > [32 rows x 1 columns] > > > > > > On Sat, Feb 13, 2016 at 1:29 PM, <josef.p...@gmail.com > > <mailto:josef.p...@gmail.com>> wrote: > > > > > > > > On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane > > <allanhald...@gmail.com <mailto:allanhald...@gmail.com>> wrote: > > > > Sorry, to reply to myself here, but looking at it with fresh > > eyes maybe the performance of the naive version isn't too bad. > > Here's a comparison of the naive vs a better implementation: > > > > def split_classes_naive(c, v): > > return [v[c == u] for u in unique(c)] > > > > def split_classes(c, v): > > perm = c.argsort() > > csrt = c[perm] > > div = where(csrt[1:] != csrt[:-1])[0] + 1 > > return [v[x] for x in split(perm, div)] > > > > >>> c = randint(0,32,size=100000) > > >>> v = arange(100000) > > >>> %timeit split_classes_naive(c,v) > > 100 loops, best of 3: 8.4 ms per loop > > >>> %timeit split_classes(c,v) > > 100 loops, best of 3: 4.79 ms per loop > > > > > > The usecases I recently started to target for similar things is 1 > > Million or more rows and 10000 uniques in the labels. > > The second version should be faster for large number of uniques, I > > guess. > > > > Overall numpy is falling far behind pandas in terms of simple > > groupby operations. bincount and histogram (IIRC) worked for some > > cases but are rather limited. > > > > reduce_at looks nice for cases where it applies. > > > > In contrast to the full sized labels in the original post, I only > > know of applications where the labels are 1-D corresponding to rows > > or columns. > > > > Josef > > > > > > In any case, maybe it is useful to Sergio or others. > > > > Allan > > > > > > On 02/13/2016 12:11 PM, Allan Haldane wrote: > > > > I've had a pretty similar idea for a new indexing function > > 'split_classes' which would help in your case, which > > essentially does > > > > def split_classes(c, v): > > return [v[c == u] for u in unique(c)] > > > > Your example could be coded as > > > > >>> [sum(c) for c in split_classes(label, data)] > > [9, 12, 15] > > > > I feel I've come across the need for such a function often > > enough that > > it might be generally useful to people as part of numpy. The > > implementation of split_classes above has pretty poor > > performance > > because it creates many temporary boolean arrays, so my plan > > for a PR > > was to have a speedy version of it that uses a single pass > > through v. > > (I often wanted to use this function on large datasets). > > > > If anyone has any comments on the idea (good idea. bad > > idea?) I'd love > > to hear. > > > > I have some further notes and examples here: > > https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21 > > > > Allan > > > > On 02/12/2016 09:40 AM, S?rgio wrote: > > > > Hello, > > > > This is my first e-mail, I will try to make the idea > simple. > > > > Similar to masked array it would be interesting to use a > > label array to > > guide operations. > > > > Ex.: > > >>> x > > labelled_array(data = > > [[0 1 2] > > [3 4 5] > > [6 7 8]], > > label = > > [[0 1 2] > > [0 1 2] > > [0 1 2]]) > > > > >>> sum(x) > > array([9, 12, 15]) > > > > The operations would create a new axis for label > indexing. > > > > You could think of it as a collection of masks, one for > > each label. > > > > I don't know a way to make something like this > > efficiently without a > > loop. Just wondering... > > > > S?rgio. > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org > > <mailto:NumPy-Discussion@scipy.org> > > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> > > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> > > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org > > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion