Re: [Scikit-learn-general] Ball tree query_radius interface

Jake Vanderplas Thu, 22 Nov 2012 08:03:43 -0800

Conrad,

On 11/22/2012 04:35 AM, Conrad Lee wrote:

We should think about whether we want to pass this rather strangebehavior on to the NearestNeighbor interface. Remember,NearestNeighbor supports different algorithms (it might for exampleuse brute force to look up the neighbors within a radius). So if Iuse the "radius_neighbors" method of NearestNeighbors using the"brute" algorithm, the return type is not in the "object" dtype, butif I use the "ball_tree" algorithm, it is.

I think you're absolutely correct that we should strive for consistency,and the inconsistency you mention should be addressed: the question ishow to best address it. Here's what I see:


In [1]: import numpy as np

In [2]: from sklearn.neighbors import NearestNeighbors

In [3]: X = np.random.random((100, 3))

In [4]: nbrs = NearestNeighbors(algorithm='brute').fit(X)

In [5]: nbrs.radius_neighbors(X, return_distance=False).dtype
Out[5]: dtype('O')

In [6]: nbrs.radius_neighbors(X[:1], return_distance=False).dtype
Out[6]: dtype('int32')

In [7]: nbrs = NearestNeighbors(algorithm='ball_tree').fit(X)

In [8]: nbrs.radius_neighbors(X, return_distance=False).dtype
Out[8]: dtype('O')

In [9]: nbrs.radius_neighbors(X[:1], return_distance=False).dtype
Out[9]: dtype('O')

The inconsistency is in the case of the "brute" method fit on a singlepoint. We could change the ball tree result to match, but if we want tomaintain a consistent interface, we should instead return dtype=objectin all cases. My opinion is that overall, a consistent interface isless confusing than maintaining special cases, but I'm happy to go withspecial behavior in special cases if that's what the community feels isbest.

This is all unnecessarily confusing for the user of NearestNeighbor,especially becaue the documentation of "radius_neighbors" suggeststhat you're only looking up the neighbors of a single point. Thus the"special case" you mentioned above (looking up the radius neighbors ofa single point), is the suggested default case according to thedocumentation, and behaves oddly.

The documentation of radius_neighbors says that X should have "lastdimension the same as that of fit data", which to me implies acollection of points. The description below that, "the new point",should probably be changed to plural.

Maybe the best thing to do would be to only support looking up theradius neighbors of one point at a time. Then we don't have to worryabout the fact that different points will have different numbers ofneighbors within a give radius.

I disagree. Querying multiple points one-by-one is much slower thanquerying them as a batch, especially for a very large number of points.We should not prevent batch queries.

One possible compromise would be to return in all cases a list or tupleof arrays rather than an array of arrays. There are several downsidesto this, but perhaps it would be less confusing to the user. I wouldprefer to keep the current behavior for the sake of efficiency andbackward compatibility, but again I'd be happy to bow on this to theconsensus of the community.


   Jake

On Wed, Nov 21, 2012 at 5:49 PM, Lars Buitinck <[email protected]<mailto:[email protected]>> wrote:


    2012/11/21 Conrad Lee <[email protected]
    <mailto:[email protected]>>:
    > The strange thing is that idxs is of dtype "object".  I thus
    can't use it in
    > the way I'd normally use it if it were an integer array.  I can't do
    > idxs.ravel() to get a flat list of the indices.  idxs.shape
    returns (1,),
    > which is awkward.  If I run `idxs.astype("u8") I get an error
    (ValueError:
    > setting an array element with a sequence.).
    >
    > Is this behavior intended?  Can it be improved?  How can I
    convert the array
    > of dtype "object" into an integer array?

    I would say this is a bug. dtype=object is annoying and unnecessary
    when the array contains integers.

    --
    Lars Buitinck
    Scientific programmer, ILPS
    University of Amsterdam

    
------------------------------------------------------------------------------
    Monitor your physical, virtual and cloud infrastructure from a single
    web console. Get in-depth insight into apps, servers, databases,
    vmware,
    SAP, cloud infrastructure, etc. Download 30-day Free Trial.
    Pricing starts from $795 for 25 servers or applications!
    http://p.sf.net/sfu/zoho_dev2dev_nov
    _______________________________________________
    Scikit-learn-general mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Ball tree query_radius interface

Reply via email to