Re: [Scikit-learn-general] Performance of LSHForest

Joel Nothman Thu, 16 Apr 2015 14:50:38 -0700

I more or less agree. Certainly we only need to do one searchsorted per
query per tree, and then do linear scans. There is a question of how close
we stay to the original LSHForest algorithm, which relies on matching
prefixes rather than hamming distance. Hamming distance is easier to
calculate in NumPy and is probably faster to calculate in C too (with or
without using POPCNT). Perhaps the only advantage of using Cython in your
solution is to avoid the memory overhead of unpackbits.


However, n_candidates before and after is arguably not sufficient if one
side has more than n_candidates with a high prefix overlap. But until we
look at the suffixes we can't know if it is closer or farther in hamming
distance.

I also think the use of n_candidates in the current code is somewhat
broken, as suggested by my XXX comment in _get_candidates, which we
discussed but did not resolve clearly. I think it will be hard to make
improvements of this sort without breaking the current results and
parameter sensitivities of the implementation.

On 17 April 2015 at 00:16, Daniel Vainsencher <[email protected]>
wrote:

> Hi Joel,
>
> To extend your analysis:
> - when n_samples*n_indices is large enough, the bottleneck is the use of
> the index, as you say.
> - when n_dimensions*n_candidates is large enough, the bottleneck is
> computation of true distances between DB points and the query.
>
> To serve well both kinds of use cases is perfectly possible, but
> requires use of the index that is both:
> A) Fast
> B) Uses the index optimally to reduce the number of candidates for which
> we compare distances.
>
> Here is a variant of your proposal (better keep track of context) that
> also requires a little Cython but improves both aspects A and B and
> reduces code complexity.
>
> Observation I:
> Only a single binary search per index is necessary, the first. After we
> find the correct location for the query binary code, we can restrict
> ourselves to the n_candidates (or even fewer) before and after that
> location.
>
> So no further binary searches are necessary at all, and the restriction
> to a small linear part of the array should be much more cache friendly.
> This makes full use of our array implementation of orderedcollection,
> instead of acting as if we were still on a binary tree implementation as
> in the original LSH-Forest paper.
>
> There is a price to pay for this simplification: we are now looking at
> (computing full distance from query for) 2*n_candidates*n_indices
> points, which can be expensive (we improved A at a cost to B).
>
> But here is where some Cython can be really useful. Observation II:
> The best information we can extract from the binary representation is
> not the distances in the tree structure, but hamming distances to the
> query.
>
> So after the restriction of I, compute the *hamming distances* of the
> 2*n_candidate*n_indices points each from the binary representation of
> the query (corresponding to the appropriate index). Then compute full
> metric only for the n_candidates with the lowest hamming distances.
>
> This should achieve a pretty good sweet spot of performance, with just a
> bit of Cython.
>
> Daniel
>
> On 04/16/2015 12:18 AM, Joel Nothman wrote:
> > Once we're dealing with large enough index and n_candidates, most time
> > is spent in searchsorted in the "synchronous ascending phase", while any
> > overhead around it is marginal. Currently we are searching over the
> > whole array in each searchsorted, while it could be rewritten to keep
> > better track of context to cut down the overall array when searching.
> > While possible, I suspect this will look confusing in Python/Numpy, and
> > Cython will be a clearer and faster way to present this logic.
> >
> > On the other hand, time spent in _compute_distances is substantial, and
> > yet most of its consumption is /outside/ of pairwise_distances. This
> > commit
> > <
> https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a
> >
> > cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
> > <https://github.com/scikit-learn/scikit-learn/pull/4603>!
> >
> > On 16 April 2015 at 12:32, Maheshakya Wijewardena
> > <[email protected] <mailto:[email protected]>> wrote:
> >
> >     Moreover, this drawback occurs because LSHForest does not vectorize
> >     multiple queries as in 'ball_tree' or any other method. This slows
> >     the exact neighbor distance calculation down significantly after
> >     approximation. This will not be a problem if queries are for
> >     individual points. Unfortunately, former is the more useful usage of
> >     LSHForest.
> >     Are you trying individual queries or multiple queries (for
> n_samples)?
> >
> >     On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher
> >     <[email protected] <mailto:[email protected]>>
> >     wrote:
> >
> >         LHSForest is not intended for dimensions at which exact methods
> >         work well, nor for tiny datasets. Try d>500, n_points>100000, I
> >         don't remember the switchover point.
> >
> >         The documentation should make this clear, but unfortunately I
> >         don't see that it does.
> >
> >         On Apr 15, 2015 7:08 PM, "Joel Nothman" <[email protected]
> >         <mailto:[email protected]>> wrote:
> >
> >             I agree this is disappointing, and we need to work on making
> >             LSHForest faster. Portions should probably be coded in
> >             Cython, for instance, as the current implementation is a bit
> >             circuitous in order to work in numpy. PRs are welcome.
> >
> >             LSHForest could use parallelism to be faster, but so can
> >             (and will) the exact neighbors methods. In theory in
> >             LSHForest, each "tree" could be stored on entirely different
> >             machines, providing memory benefits, but scikit-learn can't
> >             really take advantage of this.
> >
> >             Having said that, I would also try with higher n_features
> >             and n_queries. We have to limit the scale of our examples in
> >             order to limit the overall document compilation time.
> >
> >             On 16 April 2015 at 01:12, Miroslav Batchkarov
> >             <[email protected] <mailto:[email protected]>>
> wrote:
> >
> >                 Hi everyone,
> >
> >                 was really impressed by the speedups provided by
> >                 LSHForest compared to brute-force search. Out of
> >                 curiosity, I compared LSRForest to the existing ball
> >                 tree implementation. The approximate algorithm is
> >                 consistently slower (see below). Is this normal and
> >                 should it be mentioned in the documentation? Does
> >                 approximate search offer any benefits in terms of memory
> >                 usage?
> >
> >
> >                 I ran the same example
> >                 <
> http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py>
> with
> >                 a algorithm=ball_tree. I also had to set
> >                 metric=‘euclidean’ (this may affect results). The output
> is:
> >
> >                 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup:
> >                 0.0, accuracy: 1.00 +/-0.00
> >                 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup:
> >                 0.1, accuracy: 0.94 +/-0.05
> >                 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup:
> >                 0.2, accuracy: 0.92 +/-0.07
> >                 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup:
> >                 0.3, accuracy: 0.92 +/-0.07
> >                 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup:
> >                 0.5, accuracy: 0.84 +/-0.10
> >                 Index size: 100000, exact: 0.008s, LSHF: 0.016s,
> >                 speedup: 0.5, accuracy: 0.80 +/-0.06
> >
> >                 With n_candidates=100, the output is
> >
> >                 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup:
> >                 0.0, accuracy: 1.00 +/-0.00
> >                 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup:
> >                 0.1, accuracy: 0.94 +/-0.05
> >                 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup:
> >                 0.2, accuracy: 0.92 +/-0.07
> >                 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup:
> >                 0.4, accuracy: 0.90 +/-0.11
> >                 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup:
> >                 0.7, accuracy: 0.82 +/-0.13
> >                 Index size: 100000, exact: 0.007s, LSHF: 0.013s,
> >                 speedup: 0.6, accuracy: 0.78 +/-0.04
> >
> >
> >
> >                 ---
> >                 Miroslav Batchkarov
> >                 PhD Student,
> >                 Text Analysis Group,
> >                 Department of Informatics,
> >                 University of Sussex
> >
> >
> >
> >
> >
>  
> ------------------------------------------------------------------------------
> >                 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM
> EDT
> >                 Develop your own process in accordance with the BPMN 2
> >                 standard
> >                 Learn Process modeling best practices with Bonita BPM
> >                 through live exercises
> >
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> >                 event?utm_
> >
>  source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> >                 _______________________________________________
> >                 Scikit-learn-general mailing list
> >                 [email protected]
> >                 <mailto:[email protected]>
> >
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
>  
> ------------------------------------------------------------------------------
> >             BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> >             Develop your own process in accordance with the BPMN 2
> standard
> >             Learn Process modeling best practices with Bonita BPM
> >             through live exercises
> >
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> >             event?utm_
> >
>  source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> >             _______________________________________________
> >             Scikit-learn-general mailing list
> >             [email protected]
> >             <mailto:[email protected]>
> >
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
>  
> ------------------------------------------------------------------------------
> >         BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> >         Develop your own process in accordance with the BPMN 2 standard
> >         Learn Process modeling best practices with Bonita BPM through
> >         live exercises
> >         http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> >         event?utm_
> >
>  source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> >         _______________________________________________
> >         Scikit-learn-general mailing list
> >         [email protected]
> >         <mailto:[email protected]>
> >
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> >     --
> >     /Maheshakya Wijewardena,
> >     Undergraduate,/
> >     /Department of Computer Science and Engineering,/
> >     /Faculty of Engineering./
> >     /University of Moratuwa,/
> >     /Sri Lanka/
> >
> >
>  
> ------------------------------------------------------------------------------
> >     BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> >     Develop your own process in accordance with the BPMN 2 standard
> >     Learn Process modeling best practices with Bonita BPM through live
> >     exercises
> >     http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> >     event?utm_
> >
>  source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> >     _______________________________________________
> >     Scikit-learn-general mailing list
> >     [email protected]
> >     <mailto:[email protected]>
> >     https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> > Develop your own process in accordance with the BPMN 2 standard
> > Learn Process modeling best practices with Bonita BPM through live
> exercises
> > http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> event?utm_
> > source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> >
> >
> >
> > _______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
>
> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live
> exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Performance of LSHForest

Reply via email to