Re: [Numpy-discussion] [Suggestion] Labelled Array

Sérgio Tue, 16 Feb 2016 06:07:03 -0800

Just something I tried with pandas:

>>> image
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]],


       [[20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39]],

       [[40, 41, 42, 43, 44],
        [45, 46, 47, 48, 49],
        [50, 51, 52, 53, 54],
        [55, 56, 57, 58, 59]]])

>>> label
array([[0, 1, 2, 3, 4],
       [1, 2, 3, 4, 5],
       [2, 3, 4, 5, 6],
       [3, 4, 5, 6, 7]])

>>> dt = pd.DataFrame(np.vstack((label.ravel(), image.reshape(3, 20))).T)
>>> labelled_image = dt.groupby(0)

>>> labelled_image.mean().values
array([[ 0, 20, 40],
       [ 3, 23, 43],
       [ 6, 26, 46],
       [ 9, 29, 49],
       [10, 30, 50],
       [13, 33, 53],
       [16, 36, 56],
       [19, 39, 59]])

Sergio


> Date: Sat, 13 Feb 2016 22:41:13 -0500
> From: Allan Haldane <allanhald...@gmail.com>
> To: numpy-discussion@scipy.org
> Subject: Re: [Numpy-discussion] [Suggestion] Labelled Array
> Message-ID: <56bff759.7010...@gmail.com>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
> Impressive!
>
> Possibly there's still a case for including a 'groupby' function in
> numpy itself since it's a generally useful operation, but I do see less
> of a need given the nice pandas functionality.
>
> At least, next time someone asks a stackoverflow question like the ones
> below someone should tell them to use pandas!
>
> (copied from my gist for future list reference).
>
> http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy
>
> http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-values-in-the-array-a-condition/31484134#31484134
>
> http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-on-values-in-the-array
>
> http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smaller-arrays-in-python
>
> http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-to-a-condition-in-numpy
>
> Allan
>
>
> On 02/13/2016 01:39 PM, Jeff Reback wrote:
> > In [10]: pd.options.display.max_rows=10
> >
> > In [13]: np.random.seed(1234)
> >
> > In [14]: c = np.random.randint(0,32,size=100000)
> >
> > In [15]: v = np.arange(100000)
> >
> > In [16]: df = DataFrame({'v' : v, 'c' : c})
> >
> > In [17]: df
> > Out[17]:
> >          c      v
> > 0      15      0
> > 1      19      1
> > 2       6      2
> > 3      21      3
> > 4      12      4
> > ...    ..    ...
> > 99995   7  99995
> > 99996   2  99996
> > 99997  27  99997
> > 99998  28  99998
> > 99999   7  99999
> >
> > [100000 rows x 2 columns]
> >
> > In [19]: df.groupby('c').count()
> > Out[19]:
> >         v
> > c
> > 0   3136
> > 1   3229
> > 2   3093
> > 3   3121
> > 4   3041
> > ..   ...
> > 27  3128
> > 28  3063
> > 29  3147
> > 30  3073
> > 31  3090
> >
> > [32 rows x 1 columns]
> >
> > In [20]: %timeit df.groupby('c').count()
> > 100 loops, best of 3: 2 ms per loop
> >
> > In [21]: %timeit df.groupby('c').mean()
> > 100 loops, best of 3: 2.39 ms per loop
> >
> > In [22]: df.groupby('c').mean()
> > Out[22]:
> >                 v
> > c
> > 0   49883.384885
> > 1   50233.692165
> > 2   48634.116069
> > 3   50811.743992
> > 4   50505.368629
> > ..           ...
> > 27  49715.349425
> > 28  50363.501469
> > 29  50485.395933
> > 30  50190.155223
> > 31  50691.041748
> >
> > [32 rows x 1 columns]
> >
> >
> > On Sat, Feb 13, 2016 at 1:29 PM, <josef.p...@gmail.com
> > <mailto:josef.p...@gmail.com>> wrote:
> >
> >
> >
> >     On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane
> >     <allanhald...@gmail.com <mailto:allanhald...@gmail.com>> wrote:
> >
> >         Sorry, to reply to myself here, but looking at it with fresh
> >         eyes maybe the performance of the naive version isn't too bad.
> >         Here's a comparison of the naive vs a better implementation:
> >
> >         def split_classes_naive(c, v):
> >              return [v[c == u] for u in unique(c)]
> >
> >         def split_classes(c, v):
> >              perm = c.argsort()
> >              csrt = c[perm]
> >              div = where(csrt[1:] != csrt[:-1])[0] + 1
> >              return [v[x] for x in split(perm, div)]
> >
> >         >>> c = randint(0,32,size=100000)
> >         >>> v = arange(100000)
> >         >>> %timeit split_classes_naive(c,v)
> >         100 loops, best of 3: 8.4 ms per loop
> >         >>> %timeit split_classes(c,v)
> >         100 loops, best of 3: 4.79 ms per loop
> >
> >
> >     The usecases I recently started to target for similar things is 1
> >     Million or more rows and 10000 uniques in the labels.
> >     The second version should be faster for large number of uniques, I
> >     guess.
> >
> >     Overall numpy is falling far behind pandas in terms of simple
> >     groupby operations. bincount and histogram (IIRC) worked for some
> >     cases but are rather limited.
> >
> >     reduce_at looks nice for cases where it applies.
> >
> >     In contrast to the full sized labels in the original post, I only
> >     know of applications where the labels are 1-D corresponding to rows
> >     or columns.
> >
> >     Josef
> >
> >
> >         In any case, maybe it is useful to Sergio or others.
> >
> >         Allan
> >
> >
> >         On 02/13/2016 12:11 PM, Allan Haldane wrote:
> >
> >             I've had a pretty similar idea for a new indexing function
> >             'split_classes' which would help in your case, which
> >             essentially does
> >
> >                   def split_classes(c, v):
> >                       return [v[c == u] for u in unique(c)]
> >
> >             Your example could be coded as
> >
> >                   >>> [sum(c) for c in split_classes(label, data)]
> >                   [9, 12, 15]
> >
> >             I feel I've come across the need for such a function often
> >             enough that
> >             it might be generally useful to people as part of numpy. The
> >             implementation of split_classes above has pretty poor
> >             performance
> >             because it creates many temporary boolean arrays, so my plan
> >             for a PR
> >             was to have a speedy version of it that uses a single pass
> >             through v.
> >             (I often wanted to use this function on large datasets).
> >
> >             If anyone has any comments on the idea (good idea. bad
> >             idea?) I'd love
> >             to hear.
> >
> >             I have some further notes and examples here:
> >             https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
> >
> >             Allan
> >
> >             On 02/12/2016 09:40 AM, S?rgio wrote:
> >
> >                 Hello,
> >
> >                 This is my first e-mail, I will try to make the idea
> simple.
> >
> >                 Similar to masked array it would be interesting to use a
> >                 label array to
> >                 guide operations.
> >
> >                 Ex.:
> >                   >>> x
> >                 labelled_array(data =
> >                    [[0 1 2]
> >                    [3 4 5]
> >                    [6 7 8]],
> >                                           label =
> >                    [[0 1 2]
> >                    [0 1 2]
> >                    [0 1 2]])
> >
> >                   >>> sum(x)
> >                 array([9, 12, 15])
> >
> >                 The operations would create a new axis for label
> indexing.
> >
> >                 You could think of it as a collection of masks, one for
> >                 each label.
> >
> >                 I don't know a way to make something like this
> >                 efficiently without a
> >                 loop. Just wondering...
> >
> >                 S?rgio.
> >
> >
> >                 _______________________________________________
> >                 NumPy-Discussion mailing list
> >                 NumPy-Discussion@scipy.org
> >                 <mailto:NumPy-Discussion@scipy.org>
> >                 https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> >
> >         _______________________________________________
> >         NumPy-Discussion mailing list
> >         NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
> >         https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> >
> >     _______________________________________________
> >     NumPy-Discussion mailing list
> >     NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
> >     https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] [Suggestion] Labelled Array

Reply via email to