Re: [Scikit-learn-general] Ball tree - different metrics

nafise mehdipoor Thu, 14 May 2015 08:35:21 -0700

Thank you all for the explanation.Here I am trying to implement the balltree 
based DBSCAN algorithm and I am using the balltree class. Finally, as I meant 
to use the user-defined metric, I tried it with the code below. The time for 
this case is awfulDo you have any advice to improve it?Thank you.
Total time:  0:16:58.820000
-------------------------------------------------------import numpy as npfrom 
sklearn.neighbors import BallTree,DistanceMetric
D = 20N =40000def myDistance(x,y):     return np.sqrt(sum((x - y)**2))
beginTime = datetime.datetime.now()
dataSet = np.random.random((N, D))point = np.random.random((1, D))balltree = 
BallTree(dataSet, leaf_size = 3,  metric='pyfunc',func = myDistance)for i in 
xrange(0,40000):     ind = balltree.query_radius(point, r = 0.2) time = 
datetime.datetime.now() - beginTime
print 'Total time: ', time





 



     On Thursday, May 14, 2015 4:05 PM, Michael Eickenberg 
<[email protected]> wrote:
   

 Using EuclideanDistance: this can actually explain a factor of 10 (probably 
dimensionality-dependent)


In [56]: d = sklearn.neighbors.dist_metrics.EuclideanDistance()

In [57]: %timeit d.pairwise(x, y)
100 loops, best of 3: 6.04 ms per loop

In [58]: %timeit d.pairwise(x, y)
100 loops, best of 3: 5.93 ms per loop

In [59]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2)

In [60]: %timeit d.pairwise(x, y)
10 loops, best of 3: 66.9 ms per loop

In [61]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(3)

In [62]: %timeit d.pairwise(x, y)
1 loops, best of 3: 377 ms per loop





On Thu, May 14, 2015 at 4:00 PM, Michael Eickenberg 
<[email protected]> wrote:

Oops, I overlooked that the comparison was against EuclideanDistance proper.
But as Jake says, some optimization does happen in the l2 case and in the 
example I gave it explains a factor of 5-6.

On Thu, May 14, 2015 at 3:44 PM, Michael Eickenberg 
<[email protected]> wrote:

There may also be some implicit optimization going on. Without looking at the 
(non pure-python) source of the Minkowski
 metric, a few tests hint at the fact that p=2 and p=1 are probably treated 
differently from the rest, because some mathematical tricks can be applied that 
avoid calculating the full distance vector. So maybe some of the speed 
difference is already explainable by that, but I am not sure which is the most 
prevalent here, since I am unfamiliar with the internals of BallTree and how 
many of these evaluations it actually does.

Michael


In [38]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2)

In [39]: x = np.random.randn(100, 200)

In [40]: y = np.random.randn(200, 200)

In [41]: %timeit d.pairwise(x, y)
10 loops, best of 3: 66.8 ms per loop

In [42]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2.1)

In [43]: %timeit d.pairwise(x, y)
1 loops, best of 3: 352 ms per loop

In [44]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1.9)

In [45]: %timeit d.pairwise(x, y)
1 loops, best of 3: 353 ms per loop

In [46]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1)

In [47]: %timeit d.pairwise(x, y)
10 loops, best of 3: 60.4 ms per loop

In [48]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1.1)

In [49]: %timeit d.pairwise(x, y)
1 loops, best of 3: 357 ms per loop

In [50]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(10)

In [51]: %timeit d.pairwise(x, y)
1 loops, best of 3: 386 ms per loop

In [52]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(100)

In [53]: %timeit d.pairwise(x, y)
1 loops, best of 3: 403 ms per loop

In [54]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1000)

In [55]: %timeit d.pairwise(x, y)
1 loops, best of 3: 472 ms per loop

In [56]: 





On Thu, May 14, 2015 at 3:14 PM, Jacob Vanderplas <[email protected]> 
wrote:

Hi Nafise,In general, yes the execution time will depend on the metric, and in 
a way that can be very difficult to predict.
One reason is that because the euclidean metric is the most common, that case 
is slightly optimized. But that won't lead to a factor of ten.
The Ball Tree works by dividing the metric space into spherical nodes, and 
depending on which metric you use, there can be more or less overlap between 
the resulting nodes. As you increase the p value of a minkowski metric, there 
will generally be more overlap between nodes because in some sense the amount 
of space covered by a ball of radius R increases. Any node overlap means that 
you're less likely to throw away branches when querying the tree, meaning that 
more distance computations need to be done. All of this is highly dependent on 
the precise dimension and structure of the data you're querying: e.g. if you 
have uniformly-distributed data in high dimensions, there will be a larger 
slowdown than if you have highly-structured low-dimensional data.
Without knowing more details about the data you're looking at, I think that's 
probably what's going on. Note that if you're interested in fast queries for a 
minkowski metric in particular, I'd try the KD Tree: the node overlap issue 
shouldn't come up here.   Jake
 Jake VanderPlas Director of Research – Physical Sciences eScience Institute, 
University of Washington http://www.vanderplas.com
On Thu, May 14, 2015 at 12:48 AM, nafise mehdipoor <[email protected]> 
wrote:

Hi everyone,If the complexity of the balltree implementation depends on the 
metric? When I run my code to cluster my points and use 'minkowski' metric 
instead of 'euclidean', the time jumps from 0.8 s to 8 s!!Thank you. 
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Ball tree - different metrics

Reply via email to