On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane <allanhald...@gmail.com> wrote:
> Sorry, to reply to myself here, but looking at it with fresh eyes maybe > the performance of the naive version isn't too bad. Here's a comparison of > the naive vs a better implementation: > > def split_classes_naive(c, v): > return [v[c == u] for u in unique(c)] > > def split_classes(c, v): > perm = c.argsort() > csrt = c[perm] > div = where(csrt[1:] != csrt[:-1])[0] + 1 > return [v[x] for x in split(perm, div)] > > >>> c = randint(0,32,size=100000) > >>> v = arange(100000) > >>> %timeit split_classes_naive(c,v) > 100 loops, best of 3: 8.4 ms per loop > >>> %timeit split_classes(c,v) > 100 loops, best of 3: 4.79 ms per loop > The usecases I recently started to target for similar things is 1 Million or more rows and 10000 uniques in the labels. The second version should be faster for large number of uniques, I guess. Overall numpy is falling far behind pandas in terms of simple groupby operations. bincount and histogram (IIRC) worked for some cases but are rather limited. reduce_at looks nice for cases where it applies. In contrast to the full sized labels in the original post, I only know of applications where the labels are 1-D corresponding to rows or columns. Josef > > In any case, maybe it is useful to Sergio or others. > > Allan > > > On 02/13/2016 12:11 PM, Allan Haldane wrote: > >> I've had a pretty similar idea for a new indexing function >> 'split_classes' which would help in your case, which essentially does >> >> def split_classes(c, v): >> return [v[c == u] for u in unique(c)] >> >> Your example could be coded as >> >> >>> [sum(c) for c in split_classes(label, data)] >> [9, 12, 15] >> >> I feel I've come across the need for such a function often enough that >> it might be generally useful to people as part of numpy. The >> implementation of split_classes above has pretty poor performance >> because it creates many temporary boolean arrays, so my plan for a PR >> was to have a speedy version of it that uses a single pass through v. >> (I often wanted to use this function on large datasets). >> >> If anyone has any comments on the idea (good idea. bad idea?) I'd love >> to hear. >> >> I have some further notes and examples here: >> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21 >> >> Allan >> >> On 02/12/2016 09:40 AM, Sérgio wrote: >> >>> Hello, >>> >>> This is my first e-mail, I will try to make the idea simple. >>> >>> Similar to masked array it would be interesting to use a label array to >>> guide operations. >>> >>> Ex.: >>> >>> x >>> labelled_array(data = >>> [[0 1 2] >>> [3 4 5] >>> [6 7 8]], >>> label = >>> [[0 1 2] >>> [0 1 2] >>> [0 1 2]]) >>> >>> >>> sum(x) >>> array([9, 12, 15]) >>> >>> The operations would create a new axis for label indexing. >>> >>> You could think of it as a collection of masks, one for each label. >>> >>> I don't know a way to make something like this efficiently without a >>> loop. Just wondering... >>> >>> Sérgio. >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> https://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >> > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion