I agree this is disappointing, and we need to work on making LSHForest
faster. Portions should probably be coded in Cython, for instance, as the
current implementation is a bit circuitous in order to work in numpy. PRs
are welcome.

LSHForest could use parallelism to be faster, but so can (and will) the
exact neighbors methods. In theory in LSHForest, each "tree" could be
stored on entirely different machines, providing memory benefits, but
scikit-learn can't really take advantage of this.

Having said that, I would also try with higher n_features and n_queries. We
have to limit the scale of our examples in order to limit the overall
document compilation time.

On 16 April 2015 at 01:12, Miroslav Batchkarov <mbatchka...@gmail.com>
wrote:

> Hi everyone,
>
> was really impressed by the speedups provided by LSHForest compared to
> brute-force search. Out of curiosity, I compared LSRForest to the existing
> ball tree implementation. The approximate algorithm is consistently slower
> (see below). Is this normal and should it be mentioned in the
> documentation? Does approximate search offer any benefits in terms of
> memory usage?
>
>
> I ran the same example
> <http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py>
>  with
> a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
> affect results). The output is:
>
> Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
> 1.00 +/-0.00
> Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
> 0.94 +/-0.05
> Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
> 0.92 +/-0.07
> Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3, accuracy:
> 0.92 +/-0.07
> Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5, accuracy:
> 0.84 +/-0.10
> Index size: 100000, exact: 0.008s, LSHF: 0.016s, speedup: 0.5, accuracy:
> 0.80 +/-0.06
>
> With n_candidates=100, the output is
>
> Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
> 1.00 +/-0.00
> Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
> 0.94 +/-0.05
> Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
> 0.92 +/-0.07
> Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4, accuracy:
> 0.90 +/-0.11
> Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7, accuracy:
> 0.82 +/-0.13
> Index size: 100000, exact: 0.007s, LSHF: 0.013s, speedup: 0.6, accuracy:
> 0.78 +/-0.04
>
>
>
> ---
> Miroslav Batchkarov
> PhD Student,
> Text Analysis Group,
> Department of Informatics,
> University of Sussex
>
>
>
>
>
> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live
> exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to