Re: [Scikit-learn-general] Ball tree - different metrics

Michael Eickenberg Thu, 14 May 2015 07:07:47 -0700

Using EuclideanDistance: this can actually explain a factor of 10 (probably
dimensionality-dependent)



In [56]: d = sklearn.neighbors.dist_metrics.EuclideanDistance()

In [57]: %timeit d.pairwise(x, y)
100 loops, best of 3: 6.04 ms per loop

In [58]: %timeit d.pairwise(x, y)
100 loops, best of 3: 5.93 ms per loop

In [59]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2)

In [60]: %timeit d.pairwise(x, y)
10 loops, best of 3: 66.9 ms per loop

In [61]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(3)

In [62]: %timeit d.pairwise(x, y)
1 loops, best of 3: 377 ms per loop





On Thu, May 14, 2015 at 4:00 PM, Michael Eickenberg <
michael.eickenb...@gmail.com> wrote:

> Oops, I overlooked that the comparison was against EuclideanDistance
> proper.
> But as Jake says, some optimization does happen in the l2 case and in the
> example I gave it explains a factor of 5-6.
>
> On Thu, May 14, 2015 at 3:44 PM, Michael Eickenberg <
> michael.eickenb...@gmail.com> wrote:
>
>> There may also be some implicit optimization going on. Without looking at
>> the (non pure-python) source of the Minkowski
>>  metric, a few tests hint at the fact that p=2 and p=1 are probably
>> treated differently from the rest, because some mathematical tricks can be
>> applied that avoid calculating the full distance vector. So maybe some of
>> the speed difference is already explainable by that, but I am not sure
>> which is the most prevalent here, since I am unfamiliar with the internals
>> of BallTree and how many of these evaluations it actually does.
>>
>> Michael
>>
>>
>> In [38]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2)
>>
>> In [39]: x = np.random.randn(100, 200)
>>
>> In [40]: y = np.random.randn(200, 200)
>>
>> In [41]: %timeit d.pairwise(x, y)
>> 10 loops, best of 3: 66.8 ms per loop
>>
>> In [42]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(2.1)
>>
>> In [43]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 352 ms per loop
>>
>> In [44]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1.9)
>>
>> In [45]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 353 ms per loop
>>
>> In [46]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1)
>>
>> In [47]: %timeit d.pairwise(x, y)
>> 10 loops, best of 3: 60.4 ms per loop
>>
>> In [48]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1.1)
>>
>> In [49]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 357 ms per loop
>>
>> In [50]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(10)
>>
>> In [51]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 386 ms per loop
>>
>> In [52]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(100)
>>
>> In [53]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 403 ms per loop
>>
>> In [54]: d = sklearn.neighbors.dist_metrics.MinkowskiDistance(1000)
>>
>> In [55]: %timeit d.pairwise(x, y)
>> 1 loops, best of 3: 472 ms per loop
>>
>> In [56]:
>>
>>
>>
>>
>>
>> On Thu, May 14, 2015 at 3:14 PM, Jacob Vanderplas <
>> jake...@cs.washington.edu> wrote:
>>
>>> Hi Nafise,
>>> In general, yes the execution time will depend on the metric, and in a
>>> way that can be very difficult to predict.
>>>
>>> One reason is that because the euclidean metric is the most common, that
>>> case is slightly optimized. But that won't lead to a factor of ten.
>>>
>>> The Ball Tree works by dividing the metric space into spherical nodes,
>>> and depending on which metric you use, there can be more or less overlap
>>> between the resulting nodes. As you increase the p value of a minkowski
>>> metric, there will generally be more overlap between nodes because in some
>>> sense the amount of space covered by a ball of radius R increases. Any node
>>> overlap means that you're less likely to throw away branches when querying
>>> the tree, meaning that more distance computations need to be done. All of
>>> this is highly dependent on the precise dimension and structure of the data
>>> you're querying: e.g. if you have uniformly-distributed data in high
>>> dimensions, there will be a larger slowdown than if you have
>>> highly-structured low-dimensional data.
>>>
>>> Without knowing more details about the data you're looking at, I think
>>> that's probably what's going on. Note that if you're interested in fast
>>> queries for a minkowski metric in particular, I'd try the KD Tree: the node
>>> overlap issue shouldn't come up here.
>>>    Jake
>>>
>>>  Jake VanderPlas
>>>  Director of Research – Physical Sciences
>>>  eScience Institute, University of Washington
>>>  http://www.vanderplas.com
>>>
>>> On Thu, May 14, 2015 at 12:48 AM, nafise mehdipoor <
>>> mehdipour...@yahoo.com> wrote:
>>>
>>>> Hi everyone,
>>>> If the complexity of the balltree implementation depends on the metric?
>>>> When I run my code to cluster my points and use 'minkowski' metric instead
>>>> of 'euclidean', the time jumps from 0.8 s to 8 s!!
>>>> Thank you.
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> One dashboard for servers and applications across Physical-Virtual-Cloud
>>>> Widest out-of-the-box monitoring support with 50+ applications
>>>> Performance metrics, stats and reports that give you Actionable Insights
>>>> Deep dive visibility with transaction tracing using APM Insight.
>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> One dashboard for servers and applications across Physical-Virtual-Cloud
>>> Widest out-of-the-box monitoring support with 50+ applications
>>> Performance metrics, stats and reports that give you Actionable Insights
>>> Deep dive visibility with transaction tracing using APM Insight.
>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Ball tree - different metrics

Reply via email to