Re: [Scikit-learn-general] Ball tree - different metrics

Jacob Vanderplas Thu, 14 May 2015 16:55:07 -0700

Sorry – I should have better specified what I meant.

A user-defined metric passed as a python function will always be slow, no
matter how you define that python function.


If you want to create a custom metric that works quickly within BallTree,
you'll have to write it in cython *within* the scikit-learn source: that
is, you must modify the actual .pyx files in sklearn/neighbors/*.pyx and
*.pxd, write a new metric in Cython, make this accessible via an
appropriate string argument to the BinaryTree class, and then re-compile
the entire scikit-learn package.

The procedure is not well-documented, because this is not an anticipated
use of the code and I wouldn't necessarily recommend it. That said, if
you're very familiar with using cython you should be able to proceed by
using one of the other implemented metrics as a template.
   Jake

 Jake VanderPlas
 Director of Research – Physical Sciences
 eScience Institute, University of Washington
 http://www.vanderplas.com

On Thu, May 14, 2015 at 4:10 PM, nafise mehdipoor <mehdipour...@yahoo.com>
wrote:

> I just tried the one with compiling my metric with Cython and it still is
> too far away from what I need it to be (around 60 seconds)!
> Still for the previous code:
>
> using 'Euclidean' metric: 30 sec
> metric as a function: 43 min
> metric as a python module: 27 min
> do you have any other solution that I can try. I re-mention that
> I still haven' applied my real metric and I just comparing the Euclidean
> metric with the same user-defined one.
>
>
>
>
>   On Thursday, May 14, 2015 11:24 PM, nafise mehdipoor <
> mehdipour...@yahoo.com> wrote:
>
>
> Thank you so much. It was an example and the real metric will be defined
> within ball-tree-valid-metric conditions.
>
> The best.
>
>
>
>   On Thursday, May 14, 2015 5:47 PM, Jacob Vanderplas <
> jake...@cs.washington.edu> wrote:
>
>
> User-defined metrics will always be slow, because they rely on the Python
> layer for callbacks. The only way to improve it is to write your own metric
> in Cython and re-compile the ball tree/kd tree code with your new addition.
> Furthermore, you have to make sure your metric actually meets the
> definition of a true metric, or the tree code will not produce the right
> results.
>    Jake
>
>  Jake VanderPlas
>  Director of Research – Physical Sciences
>  eScience Institute, University of Washington
>  http://www.vanderplas.com
>
> On Thu, May 14, 2015 at 8:31 AM, nafise mehdipoor <mehdipour...@yahoo.com>
> wrote:
>
> Thank you all for the explanation.
> Here I am trying to implement the balltree based DBSCAN algorithm and I am
> using the balltree class. Finally, as I meant to use the user-defined
> metric, I tried it with the code below. The time for this case is awful
> Do you have any advice to improve it?
> Thank you.
>
> Total time:  0:16:58.820000
> -------------------------------------------------------
> import numpy as np
> from sklearn.neighbors import BallTree,DistanceMetric
>
> D = 20
> N =40000
> def myDistance(x,y):
>     return np.sqrt(sum((x - y)**2))
>
> beginTime = datetime.datetime.now()
>
> dataSet = np.random.random((N, D))
> point = np.random.random((1, D))
> balltree = BallTree(dataSet, leaf_size = 3,
> metric='pyfunc',func = myDistance)
> for i in xrange(0,40000):
>     ind = balltree.query_radius(point, r = 0.2)
> time = datetime.datetime.now() - beginTime
>
> print 'Total time: ', time
>
>
>
>
>
>
>
>
>
>   On Thursday, May 14, 2015 4:05 PM, Michael Eickenberg <
> michael.eickenb...@gmail.com> wrote:
>
>
> Using EuclideanDistance: this can actually explain a factor of 10
> (probably dimensionality-dependent)
>
>
> In [56]: d = sklearn.neighbors.dist_metrics.EuclideanDistance()
>
> In [57]: %timeit d.pairwise(x, y)
> 100 loops, best of 3: 6.04 ms per loop
>
> In [58]: %timeit d.pairwise(x, y)
> 100 loops, best of 3: 5.93 ms per loop
>
> In [59]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2)
>
> In [60]: %timeit d.pairwise(x, y)
> 10 loops, best of 3: 66.9 ms per loop
>
> In [61]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(3)
>
> In [62]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 377 ms per loop
>
>
>
>
>
> On Thu, May 14, 2015 at 4:00 PM, Michael Eickenberg <
> michael.eickenb...@gmail.com> wrote:
>
> Oops, I overlooked that the comparison was against EuclideanDistance
> proper.
> But as Jake says, some optimization does happen in the l2 case and in the
> example I gave it explains a factor of 5-6.
>
> On Thu, May 14, 2015 at 3:44 PM, Michael Eickenberg <
> michael.eickenb...@gmail.com> wrote:
>
> There may also be some implicit optimization going on. Without looking at
> the (non pure-python) source of the Minkowski
>  metric, a few tests hint at the fact that p=2 and p=1 are probably
> treated differently from the rest, because some mathematical tricks can be
> applied that avoid calculating the full distance vector. So maybe some of
> the speed difference is already explainable by that, but I am not sure
> which is the most prevalent here, since I am unfamiliar with the internals
> of BallTree and how many of these evaluations it actually does.
>
> Michael
>
>
> In [38]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2)
>
> In [39]: x = np.random.randn(100, 200)
>
> In [40]: y = np.random.randn(200, 200)
>
> In [41]: %timeit d.pairwise(x, y)
> 10 loops, best of 3: 66.8 ms per loop
>
> In [42]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2.1)
>
> In [43]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 352 ms per loop
>
> In [44]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1.9)
>
> In [45]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 353 ms per loop
>
> In [46]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1)
>
> In [47]: %timeit d.pairwise(x, y)
> 10 loops, best of 3: 60.4 ms per loop
>
> In [48]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1.1)
>
> In [49]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 357 ms per loop
>
> In [50]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(10)
>
> In [51]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 386 ms per loop
>
> In [52]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(100)
>
> In [53]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 403 ms per loop
>
> In [54]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1000)
>
> In [55]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 472 ms per loop
>
> In [56]:
>
>
>
>
>
> On Thu, May 14, 2015 at 3:14 PM, Jacob Vanderplas <
> jake...@cs.washington.edu> wrote:
>
> Hi Nafise,
> In general, yes the execution time will depend on the metric, and in a way
> that can be very difficult to predict.
>
> One reason is that because the euclidean metric is the most common, that
> case is slightly optimized. But that won't lead to a factor of ten.
>
> The Ball Tree works by dividing the metric space into spherical nodes, and
> depending on which metric you use, there can be more or less overlap
> between the resulting nodes. As you increase the p value of a minkowski
> metric, there will generally be more overlap between nodes because in some
> sense the amount of space covered by a ball of radius R increases. Any node
> overlap means that you're less likely to throw away branches when querying
> the tree, meaning that more distance computations need to be done. All of
> this is highly dependent on the precise dimension and structure of the data
> you're querying: e.g. if you have uniformly-distributed data in high
> dimensions, there will be a larger slowdown than if you have
> highly-structured low-dimensional data.
>
> Without knowing more details about the data you're looking at, I think
> that's probably what's going on. Note that if you're interested in fast
> queries for a minkowski metric in particular, I'd try the KD Tree: the node
> overlap issue shouldn't come up here.
>    Jake
>
>  Jake VanderPlas
>  Director of Research – Physical Sciences
>  eScience Institute, University of Washington
>  http://www.vanderplas.com
>
> On Thu, May 14, 2015 at 12:48 AM, nafise mehdipoor <mehdipour...@yahoo.com
> > wrote:
>
> Hi everyone,
> If the complexity of the balltree implementation depends on the metric?
> When I run my code to cluster my points and use 'minkowski' metric instead
> of 'euclidean', the time jumps from 0.8 s to 8 s!!
> Thank you.
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
>
>
>
>
>
>
>
>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Ball tree - different metrics

Reply via email to