ball tree is not vectorized in the sense of SIMD, but there is Python/numpy
overhead in LSHForest that is not present in ball tree.
I think one of the problems is the high n_candidates relative to the
n_neighbors. This really increases the search time.
Once we're dealing with large enough index and n_candidates, most time is
spent in searchsorted in the "synchronous ascending phase", while any
overhead around it is marginal. Currently we are searching over the whole
array in each searchsorted, while it could be rewritten to keep better
track of context to cut down the overall array when searching. While
possible, I suspect this will look confusing in Python/Numpy, and Cython
will be a clearer and faster way to present this logic.
On the other hand, time spent in _compute_distances is substantial, and yet
most of its consumption is *outside* of pairwise_distances. This commit
<https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a>
cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
<https://github.com/scikit-learn/scikit-learn/pull/4603>!
On 16 April 2015 at 12:32, Maheshakya Wijewardena <pmaheshak...@gmail.com>
wrote:
> Moreover, this drawback occurs because LSHForest does not vectorize
> multiple queries as in 'ball_tree' or any other method. This slows the
> exact neighbor distance calculation down significantly after approximation.
> This will not be a problem if queries are for individual points.
> Unfortunately, former is the more useful usage of LSHForest.
> Are you trying individual queries or multiple queries (for n_samples)?
>
> On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher <
> daniel.vainsenc...@gmail.com> wrote:
>
>> LHSForest is not intended for dimensions at which exact methods work
>> well, nor for tiny datasets. Try d>500, n_points>100000, I don't remember
>> the switchover point.
>>
>> The documentation should make this clear, but unfortunately I don't see
>> that it does.
>> On Apr 15, 2015 7:08 PM, "Joel Nothman" <joel.noth...@gmail.com> wrote:
>>
>>> I agree this is disappointing, and we need to work on making LSHForest
>>> faster. Portions should probably be coded in Cython, for instance, as the
>>> current implementation is a bit circuitous in order to work in numpy. PRs
>>> are welcome.
>>>
>>> LSHForest could use parallelism to be faster, but so can (and will) the
>>> exact neighbors methods. In theory in LSHForest, each "tree" could be
>>> stored on entirely different machines, providing memory benefits, but
>>> scikit-learn can't really take advantage of this.
>>>
>>> Having said that, I would also try with higher n_features and n_queries.
>>> We have to limit the scale of our examples in order to limit the overall
>>> document compilation time.
>>>
>>> On 16 April 2015 at 01:12, Miroslav Batchkarov <mbatchka...@gmail.com>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> was really impressed by the speedups provided by LSHForest compared to
>>>> brute-force search. Out of curiosity, I compared LSRForest to the existing
>>>> ball tree implementation. The approximate algorithm is consistently slower
>>>> (see below). Is this normal and should it be mentioned in the
>>>> documentation? Does approximate search offer any benefits in terms of
>>>> memory usage?
>>>>
>>>>
>>>> I ran the same example
>>>> <http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py>
>>>> with
>>>> a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
>>>> affect results). The output is:
>>>>
>>>> Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
>>>> 1.00 +/-0.00
>>>> Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
>>>> 0.94 +/-0.05
>>>> Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
>>>> 0.92 +/-0.07
>>>> Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3, accuracy:
>>>> 0.92 +/-0.07
>>>> Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5, accuracy:
>>>> 0.84 +/-0.10
>>>> Index size: 100000, exact: 0.008s, LSHF: 0.016s, speedup: 0.5,
>>>> accuracy: 0.80 +/-0.06
>>>>
>>>> With n_candidates=100, the output is
>>>>
>>>> Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
>>>> 1.00 +/-0.00
>>>> Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
>>>> 0.94 +/-0.05
>>>> Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
>>>> 0.92 +/-0.07
>>>> Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4, accuracy:
>>>> 0.90 +/-0.11
>>>> Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7, accuracy:
>>>> 0.82 +/-0.13
>>>> Index size: 100000, exact: 0.007s, LSHF: 0.013s, speedup: 0.6,
>>>> accuracy: 0.78 +/-0.04
>>>>
>>>>
>>>>
>>>> ---
>>>> Miroslav Batchkarov
>>>> PhD Student,
>>>> Text Analysis Group,
>>>> Department of Informatics,
>>>> University of Sussex
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>>>> Develop your own process in accordance with the BPMN 2 standard
>>>> Learn Process modeling best practices with Bonita BPM through live
>>>> exercises
>>>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>>>> event?utm_
>>>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>>> Develop your own process in accordance with the BPMN 2 standard
>>> Learn Process modeling best practices with Bonita BPM through live
>>> exercises
>>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>>> event?utm_
>>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>> Develop your own process in accordance with the BPMN 2 standard
>> Learn Process modeling best practices with Bonita BPM through live
>> exercises
>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>> event?utm_
>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
>
> *Maheshakya Wijewardena,Undergraduate,*
> *Department of Computer Science and Engineering,*
> *Faculty of Engineering.*
> *University of Moratuwa,*
> *Sri Lanka*
>
>
> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live
> exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general