Conrad,

On 11/22/2012 04:35 AM, Conrad Lee wrote:
We should think about whether we want to pass this rather strange behavior on to the NearestNeighbor interface. Remember, NearestNeighbor supports different algorithms (it might for example use brute force to look up the neighbors within a radius). So if I use the "radius_neighbors" method of NearestNeighbors using the "brute" algorithm, the return type is not in the "object" dtype, but if I use the "ball_tree" algorithm, it is.
I think you're absolutely correct that we should strive for consistency, and the inconsistency you mention should be addressed: the question is how to best address it. Here's what I see:

In [1]: import numpy as np

In [2]: from sklearn.neighbors import NearestNeighbors

In [3]: X = np.random.random((100, 3))

In [4]: nbrs = NearestNeighbors(algorithm='brute').fit(X)

In [5]: nbrs.radius_neighbors(X, return_distance=False).dtype
Out[5]: dtype('O')

In [6]: nbrs.radius_neighbors(X[:1], return_distance=False).dtype
Out[6]: dtype('int32')

In [7]: nbrs = NearestNeighbors(algorithm='ball_tree').fit(X)

In [8]: nbrs.radius_neighbors(X, return_distance=False).dtype
Out[8]: dtype('O')

In [9]: nbrs.radius_neighbors(X[:1], return_distance=False).dtype
Out[9]: dtype('O')

The inconsistency is in the case of the "brute" method fit on a single point. We could change the ball tree result to match, but if we want to maintain a consistent interface, we should instead return dtype=object in all cases. My opinion is that overall, a consistent interface is less confusing than maintaining special cases, but I'm happy to go with special behavior in special cases if that's what the community feels is best.

This is all unnecessarily confusing for the user of NearestNeighbor, especially becaue the documentation of "radius_neighbors" suggests that you're only looking up the neighbors of a single point. Thus the "special case" you mentioned above (looking up the radius neighbors of a single point), is the suggested default case according to the documentation, and behaves oddly.
The documentation of radius_neighbors says that X should have "last dimension the same as that of fit data", which to me implies a collection of points. The description below that, "the new point", should probably be changed to plural.

Maybe the best thing to do would be to only support looking up the radius neighbors of one point at a time. Then we don't have to worry about the fact that different points will have different numbers of neighbors within a give radius.

I disagree. Querying multiple points one-by-one is much slower than querying them as a batch, especially for a very large number of points. We should not prevent batch queries.

One possible compromise would be to return in all cases a list or tuple of arrays rather than an array of arrays. There are several downsides to this, but perhaps it would be less confusing to the user. I would prefer to keep the current behavior for the sake of efficiency and backward compatibility, but again I'd be happy to bow on this to the consensus of the community.

   Jake

On Wed, Nov 21, 2012 at 5:49 PM, Lars Buitinck <[email protected] <mailto:[email protected]>> wrote:

    2012/11/21 Conrad Lee <[email protected]
    <mailto:[email protected]>>:
    > The strange thing is that idxs is of dtype "object".  I thus
    can't use it in
    > the way I'd normally use it if it were an integer array.  I can't do
    > idxs.ravel() to get a flat list of the indices.  idxs.shape
    returns (1,),
    > which is awkward.  If I run `idxs.astype("u8") I get an error
    (ValueError:
    > setting an array element with a sequence.).
    >
    > Is this behavior intended?  Can it be improved?  How can I
    convert the array
    > of dtype "object" into an integer array?

    I would say this is a bug. dtype=object is annoying and unnecessary
    when the array contains integers.

    --
    Lars Buitinck
    Scientific programmer, ILPS
    University of Amsterdam

    
------------------------------------------------------------------------------
    Monitor your physical, virtual and cloud infrastructure from a single
    web console. Get in-depth insight into apps, servers, databases,
    vmware,
    SAP, cloud infrastructure, etc. Download 30-day Free Trial.
    Pricing starts from $795 for 25 servers or applications!
    http://p.sf.net/sfu/zoho_dev2dev_nov
    _______________________________________________
    Scikit-learn-general mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to