(sorry, I should have said the first b layers, not 2**b layers, producing a memoization of 2**b offsets)
On 30 July 2015 at 22:22, Joel Nothman <joel.noth...@gmail.com> wrote: > One approach to fixing the ascending phase would ensure that > _find_matching_indices is only searching over parts of the tree that have > not yet been explored, while currently it searches over the entire index at > each depth. > > My preferred, but more experimental, solution is to memoize where the > first 2**b layers of the tree begin and end in the index, for small b. So > if our index stored: > [[0, 0, 0, 1, 1, 0, 0, 0], > [0, 0, 1, 0, 1, 0, 1, 0], > [0, 0, 1, 0, 1, 0, 1, 0], > [0, 1, 0, 0, 0, 0, 0, 0], > [0, 1, 0, 1, 1, 0, 0, 0], > [0, 1, 1, 0, 0, 1, 1, 0], > [1, 0, 0, 0, 0, 1, 0, 1], > [1, 0, 0, 1, 0, 1, 0, 1], > [1, 1, 0, 0, 0, 0, 0, 0], > [1, 1, 1, 1, 1, 0, 0, 0]] > and b=2, we'd memoize offsets for prefixes of size 2: > [0, # 00 > 3, # 01 > 6, # 10 > 8, # 11 > ] > > Given a query like [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to > leave the first b bits [0, 1] remaining, and look them up in the array just > defined to identify a much narrower search space [3, 6) matching that > prefix in the overall index. > > Indeed, given the min_hash_match constraint, not having this sort of thing > for b >= min_hash_match seems wasteful. > > This provides us O(1) access to the top layers of the tree when ascending, > and makes the searchsorted calls run in log(n / (2 ** b)) time rather than > log(n). It is also much more like traditional LSH. However, it complexifies > the code, as we now have to consider two strategies for descent/ascent. > > > > On 30 July 2015 at 21:46, Joel Nothman <joel.noth...@gmail.com> wrote: > >> What makes you think this is the main bottleneck? While it is not an >> insignificant consumer of time, I really doubt this is what's making >> scikit-learn's LSH implementation severely underperform with respect to >> other implementations. >> >> We need to profile. In order to do that, we need some sensible parameters >> that users might actually want, e.g. number of features for {dense, sparse} >> cases, index size, target 10NN precision and recall (selecting >> corresponding n_estimators and n_candidates). Ideally we'd consider >> real-world datasets. And of course, these should be sensible for whichever >> metric we're operating over, and whether we're doing KNN or Radius searches. >> >> I don't know if it's realistic, but I've profiled the following >> bench_plot_approximate_neighbors settings: >> >> Building NearestNeighbors for 100000 samples in 100 dimensions >> LSHF parameters: n_estimators = 15, n_candidates = 100 >> Building LSHForest for 100000 samples in 100 dimensions >> Done in 1.492s >> Average time for lshf neighbor queries: 0.005s >> Average time for exact neighbor queries: 0.002s >> Average Accuracy : 0.88 >> Speed up: 0.5x >> >> Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query >> matrix, we have: >> >> - 0.03 spent in _query (hashing and descending) >> - 0.91 spent in _compute_distances (exact distance calculation) >> - 3.80 remaining in _get_candidates (ascending phase), almost all of >> which is spent in _find_matching_indices >> >> Cutting exact distance calculation to 0s will still not get this faster >> than the exact approach. Of course, your mileage may vary, but this >> suggests to me you're barking up the wrong tree (no pun intended). >> >> On 30 July 2015 at 19:43, Maheshakya Wijewardena <pmaheshak...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I've started to look into the matter of improving performance of >>> LSHForest. As we have discussed sometime before(in fact, quite a long >>> time), main concern is to Cythonize distance calculations. Currently, this >>> done by iteratively moving over all the query vectors when `kneighbors` >>> method is called for a set of query vectors. It has been identified that >>> iterating over each query with Python loops is a huge overhead. I have >>> implemented a few Cython hacks to demonstrate the distance calculation in >>> LSHForest and I was able to get an approximate speedup 10x compared to >>> current distance calculation with a Python loop. However, I came across >>> some blockers while trying to do this and need some clarifications. >>> >>> What I need to know is, do we use a mechanism to release GIL when we >>> want to parallelize. One of my observations is `pairwise_distance` uses all >>> the cores even when I don't specify `n_jobs` parameter which is 1 in >>> default. Is this an expected behavior? >>> >>> If I want to release GIL, can I use OpenMP module in Cython? Or is that >>> a task of Joblib? >>> Any input on this is highly appreciated. >>> >>> Best regards, >>> -- >>> >>> *Maheshakya Wijewardena,Undergraduate,* >>> *Department of Computer Science and Engineering,* >>> *Faculty of Engineering.* >>> *University of Moratuwa,* >>> *Sri Lanka* >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >> >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general