Re: [Scikit-learn-general] Ball tree - different metrics

Jacob Vanderplas Thu, 14 May 2015 17:00:18 -0700

If you're computing neighbors of very high dimensional data, you might
consider an approximate search instead, for example one based on Locality
Sensitive Hashing. There are some papers out there about kernelized LSH
which allow a wide range of kernels, and many metrics can be approximated
with a judiciously chosen kernel,
   Jake


 Jake VanderPlas
 Director of Research – Physical Sciences
 eScience Institute, University of Washington
 http://www.vanderplas.com

On Thu, May 14, 2015 at 4:54 PM, Jacob Vanderplas <jake...@cs.washington.edu
> wrote:

> Sorry – I should have better specified what I meant.
>
> A user-defined metric passed as a python function will always be slow, no
> matter how you define that python function.
>
> If you want to create a custom metric that works quickly within BallTree,
> you'll have to write it in cython *within* the scikit-learn source: that
> is, you must modify the actual .pyx files in sklearn/neighbors/*.pyx and
> *.pxd, write a new metric in Cython, make this accessible via an
> appropriate string argument to the BinaryTree class, and then re-compile
> the entire scikit-learn package.
>
> The procedure is not well-documented, because this is not an anticipated
> use of the code and I wouldn't necessarily recommend it. That said, if
> you're very familiar with using cython you should be able to proceed by
> using one of the other implemented metrics as a template.
>    Jake
>
>  Jake VanderPlas
>  Director of Research – Physical Sciences
>  eScience Institute, University of Washington
>  http://www.vanderplas.com
>
> On Thu, May 14, 2015 at 4:10 PM, nafise mehdipoor <mehdipour...@yahoo.com>
> wrote:
>
>> I just tried the one with compiling my metric with Cython and it still is
>> too far away from what I need it to be (around 60 seconds)!
>> Still for the previous code:
>>
>> using 'Euclidean' metric: 30 sec
>> metric as a function: 43 min
>> metric as a python module: 27 min
>> do you have any other solution that I can try. I re-mention that
>> I still haven' applied my real metric and I just comparing the Euclidean
>> metric with the same user-defined one.
>>
>>
>>
>>
>>   On Thursday, May 14, 2015 11:24 PM, nafise mehdipoor <
>> mehdipour...@yahoo.com> wrote:
>>
>>
>> Thank you so much. It was an example and the real metric will be defined
>> within ball-tree-valid-metric conditions.
>>
>> The best.
>>
>>
>>
>>   On Thursday, May 14, 2015 5:47 PM, Jacob Vanderplas <
>> jake...@cs.washington.edu> wrote:
>>
>>
>> User-defined metrics will always be slow, because they rely on the Python
>> layer for callbacks. The only way to improve it is to write your own metric
>> in Cython and re-compile the ball tree/kd tree code with your new addition.
>> Furthermore, you have to make sure your metric actually meets the
>> definition of a true metric, or the tree code will not produce the right
>> results.
>>    Jake
>>
>>  Jake VanderPlas
>>  Director of Research – Physical Sciences
>>  eScience Institute, University of Washington
>>  http://www.vanderplas.com
>>
>> On Thu, May 14, 2015 at 8:31 AM, nafise mehdipoor <mehdipour...@yahoo.com
>> > wrote:
>>
>> Thank you all for the explanation.
>> Here I am trying to implement the balltree based DBSCAN algorithm and I
>> am using the balltree class. Finally, as I meant to use the user-defined
>> metric, I tried it with the code below. The time for this case is awful
>> Do you have any advice to improve it?
>> Thank you.
>>
>> Total time:  0:16:58.820000
>> -------------------------------------------------------
>> import numpy as np
>> from sklearn.neighbors import BallTree,DistanceMetric
>>
>> D = 20
>> N =40000
>> def myDistance(x,y):
>>     return np.sqrt(sum((x - y)**2))
>>
>> beginTime = datetime.datetime.now()
>>
>> dataSet = np.random.random((N, D))
>> point = np.random.random((1, D))
>> balltree = BallTree(dataSet, leaf_size = 3,
>> metric='pyfunc',func = myDistance)
>> for i in xrange(0,40000):
>>     ind = balltree.query_radius(point, r = 0.2)
>> time = datetime.datetime.now() - beginTime
>>
>> print 'Total time: ', time
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>   On Thursday, May 14, 2015 4:05 PM, Michael Eickenberg <
>> michael.eickenb...@gmail.com> wrote:
>>
>>
>> Using EuclideanDistance: this can actually explain a factor of 10
>> (probably dimensionality-dependent)
>>
>>
>> In [56]: d = sklearn.neighbors.dist_metrics.EuclideanDistance()
>>
>> In [57]: %timeit d.pairwise(x, y)
>> 100 loops, best of 3: 6.04 ms per loop
>>
>> In [58]: %timeit d.pairwise(x, y)
>> 100 loops, best of 3: 5.93 ms per loop
>>
>> In [59]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2)
>>
>> In [60]: %timeit d.pairwise(x, y)
>> 10 loops, best of 3: 66.9 ms per loop
>>
>> In [61]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(3)
>>
>> In [62]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 377 ms per loop
>>
>>
>>
>>
>>
>> On Thu, May 14, 2015 at 4:00 PM, Michael Eickenberg <
>> michael.eickenb...@gmail.com> wrote:
>>
>> Oops, I overlooked that the comparison was against EuclideanDistance
>> proper.
>> But as Jake says, some optimization does happen in the l2 case and in the
>> example I gave it explains a factor of 5-6.
>>
>> On Thu, May 14, 2015 at 3:44 PM, Michael Eickenberg <
>> michael.eickenb...@gmail.com> wrote:
>>
>> There may also be some implicit optimization going on. Without looking at
>> the (non pure-python) source of the Minkowski
>>  metric, a few tests hint at the fact that p=2 and p=1 are probably
>> treated differently from the rest, because some mathematical tricks can be
>> applied that avoid calculating the full distance vector. So maybe some of
>> the speed difference is already explainable by that, but I am not sure
>> which is the most prevalent here, since I am unfamiliar with the internals
>> of BallTree and how many of these evaluations it actually does.
>>
>> Michael
>>
>>
>> In [38]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2)
>>
>> In [39]: x = np.random.randn(100, 200)
>>
>> In [40]: y = np.random.randn(200, 200)
>>
>> In [41]: %timeit d.pairwise(x, y)
>> 10 loops, best of 3: 66.8 ms per loop
>>
>> In [42]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2.1)
>>
>> In [43]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 352 ms per loop
>>
>> In [44]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1.9)
>>
>> In [45]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 353 ms per loop
>>
>> In [46]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1)
>>
>> In [47]: %timeit d.pairwise(x, y)
>> 10 loops, best of 3: 60.4 ms per loop
>>
>> In [48]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1.1)
>>
>> In [49]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 357 ms per loop
>>
>> In [50]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(10)
>>
>> In [51]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 386 ms per loop
>>
>> In [52]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(100)
>>
>> In [53]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 403 ms per loop
>>
>> In [54]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1000)
>>
>> In [55]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 472 ms per loop
>>
>> In [56]:
>>
>>
>>
>>
>>
>> On Thu, May 14, 2015 at 3:14 PM, Jacob Vanderplas <
>> jake...@cs.washington.edu> wrote:
>>
>> Hi Nafise,
>> In general, yes the execution time will depend on the metric, and in a
>> way that can be very difficult to predict.
>>
>> One reason is that because the euclidean metric is the most common, that
>> case is slightly optimized. But that won't lead to a factor of ten.
>>
>> The Ball Tree works by dividing the metric space into spherical nodes,
>> and depending on which metric you use, there can be more or less overlap
>> between the resulting nodes. As you increase the p value of a minkowski
>> metric, there will generally be more overlap between nodes because in some
>> sense the amount of space covered by a ball of radius R increases. Any node
>> overlap means that you're less likely to throw away branches when querying
>> the tree, meaning that more distance computations need to be done. All of
>> this is highly dependent on the precise dimension and structure of the data
>> you're querying: e.g. if you have uniformly-distributed data in high
>> dimensions, there will be a larger slowdown than if you have
>> highly-structured low-dimensional data.
>>
>> Without knowing more details about the data you're looking at, I think
>> that's probably what's going on. Note that if you're interested in fast
>> queries for a minkowski metric in particular, I'd try the KD Tree: the node
>> overlap issue shouldn't come up here.
>>    Jake
>>
>>  Jake VanderPlas
>>  Director of Research – Physical Sciences
>>  eScience Institute, University of Washington
>>  http://www.vanderplas.com
>>
>> On Thu, May 14, 2015 at 12:48 AM, nafise mehdipoor <
>> mehdipour...@yahoo.com> wrote:
>>
>> Hi everyone,
>> If the complexity of the balltree implementation depends on the metric?
>> When I run my code to cluster my points and use 'minkowski' metric instead
>> of 'euclidean', the time jumps from 0.8 s to 8 s!!
>> Thank you.
>>
>>
>>
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Ball tree - different metrics

Reply via email to