Conrad,
On 11/22/2012 04:35 AM, Conrad Lee wrote:
We should think about whether we want to pass this rather strange
behavior on to the NearestNeighbor interface. Remember,
NearestNeighbor supports different algorithms (it might for example
use brute force to look up the neighbors within a radius). So if I
use the "radius_neighbors" method of NearestNeighbors using the
"brute" algorithm, the return type is not in the "object" dtype, but
if I use the "ball_tree" algorithm, it is.
I think you're absolutely correct that we should strive for consistency,
and the inconsistency you mention should be addressed: the question is
how to best address it. Here's what I see:
In [1]: import numpy as np
In [2]: from sklearn.neighbors import NearestNeighbors
In [3]: X = np.random.random((100, 3))
In [4]: nbrs = NearestNeighbors(algorithm='brute').fit(X)
In [5]: nbrs.radius_neighbors(X, return_distance=False).dtype
Out[5]: dtype('O')
In [6]: nbrs.radius_neighbors(X[:1], return_distance=False).dtype
Out[6]: dtype('int32')
In [7]: nbrs = NearestNeighbors(algorithm='ball_tree').fit(X)
In [8]: nbrs.radius_neighbors(X, return_distance=False).dtype
Out[8]: dtype('O')
In [9]: nbrs.radius_neighbors(X[:1], return_distance=False).dtype
Out[9]: dtype('O')
The inconsistency is in the case of the "brute" method fit on a single
point. We could change the ball tree result to match, but if we want to
maintain a consistent interface, we should instead return dtype=object
in all cases. My opinion is that overall, a consistent interface is
less confusing than maintaining special cases, but I'm happy to go with
special behavior in special cases if that's what the community feels is
best.
This is all unnecessarily confusing for the user of NearestNeighbor,
especially becaue the documentation of "radius_neighbors" suggests
that you're only looking up the neighbors of a single point. Thus the
"special case" you mentioned above (looking up the radius neighbors of
a single point), is the suggested default case according to the
documentation, and behaves oddly.
The documentation of radius_neighbors says that X should have "last
dimension the same as that of fit data", which to me implies a
collection of points. The description below that, "the new point",
should probably be changed to plural.
Maybe the best thing to do would be to only support looking up the
radius neighbors of one point at a time. Then we don't have to worry
about the fact that different points will have different numbers of
neighbors within a give radius.
I disagree. Querying multiple points one-by-one is much slower than
querying them as a batch, especially for a very large number of points.
We should not prevent batch queries.
One possible compromise would be to return in all cases a list or tuple
of arrays rather than an array of arrays. There are several downsides
to this, but perhaps it would be less confusing to the user. I would
prefer to keep the current behavior for the sake of efficiency and
backward compatibility, but again I'd be happy to bow on this to the
consensus of the community.
Jake
On Wed, Nov 21, 2012 at 5:49 PM, Lars Buitinck <[email protected]
<mailto:[email protected]>> wrote:
2012/11/21 Conrad Lee <[email protected]
<mailto:[email protected]>>:
> The strange thing is that idxs is of dtype "object". I thus
can't use it in
> the way I'd normally use it if it were an integer array. I can't do
> idxs.ravel() to get a flat list of the indices. idxs.shape
returns (1,),
> which is awkward. If I run `idxs.astype("u8") I get an error
(ValueError:
> setting an array element with a sequence.).
>
> Is this behavior intended? Can it be improved? How can I
convert the array
> of dtype "object" into an integer array?
I would say this is a bug. dtype=object is annoying and unnecessary
when the array contains integers.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases,
vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general