Re: [Scikit-learn-general] Ball tree - different metrics

Michael Eickenberg Thu, 14 May 2015 07:01:57 -0700

Oops, I overlooked that the comparison was against EuclideanDistance proper.
But as Jake says, some optimization does happen in the l2 case and in the
example I gave it explains a factor of 5-6.


On Thu, May 14, 2015 at 3:44 PM, Michael Eickenberg <
michael.eickenb...@gmail.com> wrote:

> There may also be some implicit optimization going on. Without looking at
> the (non pure-python) source of the Minkowski
>  metric, a few tests hint at the fact that p=2 and p=1 are probably
> treated differently from the rest, because some mathematical tricks can be
> applied that avoid calculating the full distance vector. So maybe some of
> the speed difference is already explainable by that, but I am not sure
> which is the most prevalent here, since I am unfamiliar with the internals
> of BallTree and how many of these evaluations it actually does.
>
> Michael
>
>
> In [38]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2)
>
> In [39]: x = np.random.randn(100, 200)
>
> In [40]: y = np.random.randn(200, 200)
>
> In [41]: %timeit d.pairwise(x, y)
> 10 loops, best of 3: 66.8 ms per loop
>
> In [42]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2.1)
>
> In [43]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 352 ms per loop
>
> In [44]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1.9)
>
> In [45]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 353 ms per loop
>
> In [46]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1)
>
> In [47]: %timeit d.pairwise(x, y)
> 10 loops, best of 3: 60.4 ms per loop
>
> In [48]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1.1)
>
> In [49]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 357 ms per loop
>
> In [50]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(10)
>
> In [51]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 386 ms per loop
>
> In [52]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(100)
>
> In [53]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 403 ms per loop
>
> In [54]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1000)
>
> In [55]: %timeit d.pairwise(x, y)
> 1 loops, best of 3: 472 ms per loop
>
> In [56]:
>
>
>
>
>
> On Thu, May 14, 2015 at 3:14 PM, Jacob Vanderplas <
> jake...@cs.washington.edu> wrote:
>
>> Hi Nafise,
>> In general, yes the execution time will depend on the metric, and in a
>> way that can be very difficult to predict.
>>
>> One reason is that because the euclidean metric is the most common, that
>> case is slightly optimized. But that won't lead to a factor of ten.
>>
>> The Ball Tree works by dividing the metric space into spherical nodes,
>> and depending on which metric you use, there can be more or less overlap
>> between the resulting nodes. As you increase the p value of a minkowski
>> metric, there will generally be more overlap between nodes because in some
>> sense the amount of space covered by a ball of radius R increases. Any node
>> overlap means that you're less likely to throw away branches when querying
>> the tree, meaning that more distance computations need to be done. All of
>> this is highly dependent on the precise dimension and structure of the data
>> you're querying: e.g. if you have uniformly-distributed data in high
>> dimensions, there will be a larger slowdown than if you have
>> highly-structured low-dimensional data.
>>
>> Without knowing more details about the data you're looking at, I think
>> that's probably what's going on. Note that if you're interested in fast
>> queries for a minkowski metric in particular, I'd try the KD Tree: the node
>> overlap issue shouldn't come up here.
>>    Jake
>>
>>  Jake VanderPlas
>>  Director of Research – Physical Sciences
>>  eScience Institute, University of Washington
>>  http://www.vanderplas.com
>>
>> On Thu, May 14, 2015 at 12:48 AM, nafise mehdipoor <
>> mehdipour...@yahoo.com> wrote:
>>
>>> Hi everyone,
>>> If the complexity of the balltree implementation depends on the metric?
>>> When I run my code to cluster my points and use 'minkowski' metric instead
>>> of 'euclidean', the time jumps from 0.8 s to 8 s!!
>>> Thank you.
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> One dashboard for servers and applications across Physical-Virtual-Cloud
>>> Widest out-of-the-box monitoring support with 50+ applications
>>> Performance metrics, stats and reports that give you Actionable Insights
>>> Deep dive visibility with transaction tracing using APM Insight.
>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Ball tree - different metrics

Reply via email to