Hi Joel, I was on vacation during past 3 days. I''ll look into this asap and let you all know.
I also did some profiling, but only with the usage of `pairwise_distance` method. Brute force technique directly uses that for the entire query array, but LSH uses that in a loop and I noticed there is a huge lag. I'll first confirm your claims. I can start working on this but I think I'll need your or some other contributers' reviewing as well . I'll do this if it's possible. On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman <joel.noth...@gmail.com> wrote: > @Maheshakya, will you be able to do work in the near future on speeding up > the ascending phase instead? Or should another contributor take that baton? > Not only does it seem to be a major contributor to runtime, but it is > independent of metric and hashing mechanism (within binary hashes), and > hence the most fundamental component of LSHForest.T > > On 30 July 2015 at 22:28, Joel Nothman <joel.noth...@gmail.com> wrote: > >> (sorry, I should have said the first b layers, not 2**b layers, producing >> a memoization of 2**b offsets >> >> On 30 July 2015 at 22:22, Joel Nothman <joel.noth...@gmail.com> wrote: >> >>> One approach to fixing the ascending phase would ensure that >>> _find_matching_indices is only searching over parts of the tree that have >>> not yet been explored, while currently it searches over the entire index at >>> each depth. >>> >>> My preferred, but more experimental, solution is to memoize where the >>> first 2**b layers of the tree begin and end in the index, for small b. So >>> if our index stored: >>> [[0, 0, 0, 1, 1, 0, 0, 0], >>> [0, 0, 1, 0, 1, 0, 1, 0], >>> [0, 0, 1, 0, 1, 0, 1, 0], >>> [0, 1, 0, 0, 0, 0, 0, 0], >>> [0, 1, 0, 1, 1, 0, 0, 0], >>> [0, 1, 1, 0, 0, 1, 1, 0], >>> [1, 0, 0, 0, 0, 1, 0, 1], >>> [1, 0, 0, 1, 0, 1, 0, 1], >>> [1, 1, 0, 0, 0, 0, 0, 0], >>> [1, 1, 1, 1, 1, 0, 0, 0]] >>> and b=2, we'd memoize offsets for prefixes of size 2: >>> [0, # 00 >>> 3, # 01 >>> 6, # 10 >>> 8, # 11 >>> ] >>> >>> Given a query like [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to >>> leave the first b bits [0, 1] remaining, and look them up in the array just >>> defined to identify a much narrower search space [3, 6) matching that >>> prefix in the overall index. >>> >>> Indeed, given the min_hash_match constraint, not having this sort of >>> thing for b >= min_hash_match seems wasteful. >>> >>> This provides us O(1) access to the top layers of the tree when >>> ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time >>> rather than log(n). It is also much more like traditional LSH. However, it >>> complexifies the code, as we now have to consider two strategies for >>> descent/ascent. >>> >>> >>> >>> On 30 July 2015 at 21:46, Joel Nothman <joel.noth...@gmail.com> wrote: >>> >>>> What makes you think this is the main bottleneck? While it is not an >>>> insignificant consumer of time, I really doubt this is what's making >>>> scikit-learn's LSH implementation severely underperform with respect to >>>> other implementations. >>>> >>>> We need to profile. In order to do that, we need some sensible >>>> parameters that users might actually want, e.g. number of features for >>>> {dense, sparse} cases, index size, target 10NN precision and recall >>>> (selecting corresponding n_estimators and n_candidates). Ideally we'd >>>> consider real-world datasets. And of course, these should be sensible for >>>> whichever metric we're operating over, and whether we're doing KNN or >>>> Radius searches. >>>> >>>> I don't know if it's realistic, but I've profiled the following >>>> bench_plot_approximate_neighbors settings: >>>> >>>> Building NearestNeighbors for 100000 samples in 100 dimensions >>>> LSHF parameters: n_estimators = 15, n_candidates = 100 >>>> Building LSHForest for 100000 samples in 100 dimensions >>>> Done in 1.492s >>>> Average time for lshf neighbor queries: 0.005s >>>> Average time for exact neighbor queries: 0.002s >>>> Average Accuracy : 0.88 >>>> Speed up: 0.5x >>>> >>>> Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query >>>> matrix, we have: >>>> >>>> - 0.03 spent in _query (hashing and descending) >>>> - 0.91 spent in _compute_distances (exact distance calculation) >>>> - 3.80 remaining in _get_candidates (ascending phase), almost all >>>> of which is spent in _find_matching_indices >>>> >>>> Cutting exact distance calculation to 0s will still not get this faster >>>> than the exact approach. Of course, your mileage may vary, but this >>>> suggests to me you're barking up the wrong tree (no pun intended). >>>> >>>> On 30 July 2015 at 19:43, Maheshakya Wijewardena < >>>> pmaheshak...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I've started to look into the matter of improving performance of >>>>> LSHForest. As we have discussed sometime before(in fact, quite a long >>>>> time), main concern is to Cythonize distance calculations. Currently, this >>>>> done by iteratively moving over all the query vectors when `kneighbors` >>>>> method is called for a set of query vectors. It has been identified that >>>>> iterating over each query with Python loops is a huge overhead. I have >>>>> implemented a few Cython hacks to demonstrate the distance calculation in >>>>> LSHForest and I was able to get an approximate speedup 10x compared to >>>>> current distance calculation with a Python loop. However, I came across >>>>> some blockers while trying to do this and need some clarifications. >>>>> >>>>> What I need to know is, do we use a mechanism to release GIL when we >>>>> want to parallelize. One of my observations is `pairwise_distance` uses >>>>> all >>>>> the cores even when I don't specify `n_jobs` parameter which is 1 in >>>>> default. Is this an expected behavior? >>>>> >>>>> If I want to release GIL, can I use OpenMP module in Cython? Or is >>>>> that a task of Joblib? >>>>> Any input on this is highly appreciated. >>>>> >>>>> Best regards, >>>>> -- >>>>> >>>>> *Maheshakya Wijewardena,Undergraduate,* >>>>> *Department of Computer Science and Engineering,* >>>>> *Faculty of Engineering.* >>>>> *University of Moratuwa,* >>>>> *Sri Lanka* >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> Scikit-learn-general@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> >>>> >>> >> > -- *Maheshakya Wijewardena,Undergraduate,* *Department of Computer Science and Engineering,* *Faculty of Engineering.* *University of Moratuwa,* *Sri Lanka*
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general