Thanks, I look forward to this being improved, while I have little availability to help myself atm.
On 2 August 2015 at 22:58, Maheshakya Wijewardena <pmaheshak...@gmail.com> wrote: > I agree with Joel. Profiling indicated that 69.8% of total time of > kneighbors is spent on _find_matching_indices and 22.9% is spent on > _compute_distances. So I'll give priority to work on _find_matching_indices > with the method you suggested. > > On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena < > pmaheshak...@gmail.com> wrote: > >> Hi Joel, >> I was on vacation during past 3 days. I''ll look into this asap and let >> you all know. >> >> I also did some profiling, but only with the usage of `pairwise_distance` >> method. Brute force technique directly uses that for the entire query >> array, but LSH uses that in a loop and I noticed there is a huge lag. I'll >> first confirm your claims. I can start working on this but I think I'll >> need your or some other contributers' reviewing as well . I'll do this if >> it's possible. >> >> On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman <joel.noth...@gmail.com> >> wrote: >> >>> @Maheshakya, will you be able to do work in the near future on speeding >>> up the ascending phase instead? Or should another contributor take that >>> baton? Not only does it seem to be a major contributor to runtime, but it >>> is independent of metric and hashing mechanism (within binary hashes), and >>> hence the most fundamental component of LSHForest.T >>> >>> On 30 July 2015 at 22:28, Joel Nothman <joel.noth...@gmail.com> wrote: >>> >>>> (sorry, I should have said the first b layers, not 2**b layers, >>>> producing a memoization of 2**b offsets >>>> >>>> On 30 July 2015 at 22:22, Joel Nothman <joel.noth...@gmail.com> wrote: >>>> >>>>> One approach to fixing the ascending phase would ensure that >>>>> _find_matching_indices is only searching over parts of the tree that have >>>>> not yet been explored, while currently it searches over the entire index >>>>> at >>>>> each depth. >>>>> >>>>> My preferred, but more experimental, solution is to memoize where the >>>>> first 2**b layers of the tree begin and end in the index, for small b. So >>>>> if our index stored: >>>>> [[0, 0, 0, 1, 1, 0, 0, 0], >>>>> [0, 0, 1, 0, 1, 0, 1, 0], >>>>> [0, 0, 1, 0, 1, 0, 1, 0], >>>>> [0, 1, 0, 0, 0, 0, 0, 0], >>>>> [0, 1, 0, 1, 1, 0, 0, 0], >>>>> [0, 1, 1, 0, 0, 1, 1, 0], >>>>> [1, 0, 0, 0, 0, 1, 0, 1], >>>>> [1, 0, 0, 1, 0, 1, 0, 1], >>>>> [1, 1, 0, 0, 0, 0, 0, 0], >>>>> [1, 1, 1, 1, 1, 0, 0, 0]] >>>>> and b=2, we'd memoize offsets for prefixes of size 2: >>>>> [0, # 00 >>>>> 3, # 01 >>>>> 6, # 10 >>>>> 8, # 11 >>>>> ] >>>>> >>>>> Given a query like [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down >>>>> to leave the first b bits [0, 1] remaining, and look them up in the array >>>>> just defined to identify a much narrower search space [3, 6) matching that >>>>> prefix in the overall index. >>>>> >>>>> Indeed, given the min_hash_match constraint, not having this sort of >>>>> thing for b >= min_hash_match seems wasteful. >>>>> >>>>> This provides us O(1) access to the top layers of the tree when >>>>> ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time >>>>> rather than log(n). It is also much more like traditional LSH. However, it >>>>> complexifies the code, as we now have to consider two strategies for >>>>> descent/ascent. >>>>> >>>>> >>>>> >>>>> On 30 July 2015 at 21:46, Joel Nothman <joel.noth...@gmail.com> wrote: >>>>> >>>>>> What makes you think this is the main bottleneck? While it is not an >>>>>> insignificant consumer of time, I really doubt this is what's making >>>>>> scikit-learn's LSH implementation severely underperform with respect to >>>>>> other implementations. >>>>>> >>>>>> We need to profile. In order to do that, we need some sensible >>>>>> parameters that users might actually want, e.g. number of features for >>>>>> {dense, sparse} cases, index size, target 10NN precision and recall >>>>>> (selecting corresponding n_estimators and n_candidates). Ideally we'd >>>>>> consider real-world datasets. And of course, these should be sensible for >>>>>> whichever metric we're operating over, and whether we're doing KNN or >>>>>> Radius searches. >>>>>> >>>>>> I don't know if it's realistic, but I've profiled the following >>>>>> bench_plot_approximate_neighbors settings: >>>>>> >>>>>> Building NearestNeighbors for 100000 samples in 100 dimensions >>>>>> LSHF parameters: n_estimators = 15, n_candidates = 100 >>>>>> Building LSHForest for 100000 samples in 100 dimensions >>>>>> Done in 1.492s >>>>>> Average time for lshf neighbor queries: 0.005s >>>>>> Average time for exact neighbor queries: 0.002s >>>>>> Average Accuracy : 0.88 >>>>>> Speed up: 0.5x >>>>>> >>>>>> Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query >>>>>> matrix, we have: >>>>>> >>>>>> - 0.03 spent in _query (hashing and descending) >>>>>> - 0.91 spent in _compute_distances (exact distance calculation) >>>>>> - 3.80 remaining in _get_candidates (ascending phase), almost all >>>>>> of which is spent in _find_matching_indices >>>>>> >>>>>> Cutting exact distance calculation to 0s will still not get this >>>>>> faster than the exact approach. Of course, your mileage may vary, but >>>>>> this >>>>>> suggests to me you're barking up the wrong tree (no pun intended). >>>>>> >>>>>> On 30 July 2015 at 19:43, Maheshakya Wijewardena < >>>>>> pmaheshak...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I've started to look into the matter of improving performance of >>>>>>> LSHForest. As we have discussed sometime before(in fact, quite a long >>>>>>> time), main concern is to Cythonize distance calculations. Currently, >>>>>>> this >>>>>>> done by iteratively moving over all the query vectors when `kneighbors` >>>>>>> method is called for a set of query vectors. It has been identified that >>>>>>> iterating over each query with Python loops is a huge overhead. I have >>>>>>> implemented a few Cython hacks to demonstrate the distance calculation >>>>>>> in >>>>>>> LSHForest and I was able to get an approximate speedup 10x compared to >>>>>>> current distance calculation with a Python loop. However, I came across >>>>>>> some blockers while trying to do this and need some clarifications. >>>>>>> >>>>>>> What I need to know is, do we use a mechanism to release GIL when we >>>>>>> want to parallelize. One of my observations is `pairwise_distance` uses >>>>>>> all >>>>>>> the cores even when I don't specify `n_jobs` parameter which is 1 in >>>>>>> default. Is this an expected behavior? >>>>>>> >>>>>>> If I want to release GIL, can I use OpenMP module in Cython? Or is >>>>>>> that a task of Joblib? >>>>>>> Any input on this is highly appreciated. >>>>>>> >>>>>>> Best regards, >>>>>>> -- >>>>>>> >>>>>>> *Maheshakya Wijewardena,Undergraduate,* >>>>>>> *Department of Computer Science and Engineering,* >>>>>>> *Faculty of Engineering.* >>>>>>> *University of Moratuwa,* >>>>>>> *Sri Lanka* >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Scikit-learn-general mailing list >>>>>>> Scikit-learn-general@lists.sourceforge.net >>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> >> -- >> >> *Maheshakya Wijewardena,Undergraduate,* >> *Department of Computer Science and Engineering,* >> *Faculty of Engineering.* >> *University of Moratuwa,* >> *Sri Lanka* >> > > > > -- > > *Maheshakya Wijewardena,Undergraduate,* > *Department of Computer Science and Engineering,* > *Faculty of Engineering.* > *University of Moratuwa,* > *Sri Lanka* >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general