It's nice to see some decent speed-up factors, though the accuracy tradeoff is still not so great. Still, I'd like to see the code and where we can go from here. Great work so far!
On 7 August 2015 at 07:50, Maheshakya Wijewardena <pmaheshak...@gmail.com> wrote: > I did a rough implementation, setting b = min_hash_match. The result I got > from running the benchmark is attached. It was able to roughly triple the > speed-up of kneighbors function for large index sizes. However, this > implementation adds some overhead to fitting time as there are 2**b * > n_estimators times numpy searchsorted calls during training. But that may > most probably be compensated as the number of queries grow since 2**b * > n_estimators is a constant time. > > I'll send a PR with proper refactoring. > > > On Sun, Aug 2, 2015 at 6:41 PM, Joel Nothman <joel.noth...@gmail.com> > wrote: > >> Thanks, I look forward to this being improved, while I have little >> availability to help myself atm. >> >> On 2 August 2015 at 22:58, Maheshakya Wijewardena <pmaheshak...@gmail.com >> > wrote: >> >>> I agree with Joel. Profiling indicated that 69.8% of total time of >>> kneighbors is spent on _find_matching_indices and 22.9% is spent on >>> _compute_distances. So I'll give priority to work on _find_matching_indices >>> with the method you suggested. >>> >>> On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena < >>> pmaheshak...@gmail.com> wrote: >>> >>>> Hi Joel, >>>> I was on vacation during past 3 days. I''ll look into this asap and let >>>> you all know. >>>> >>>> I also did some profiling, but only with the usage of >>>> `pairwise_distance` method. Brute force technique directly uses that for >>>> the entire query array, but LSH uses that in a loop and I noticed there is >>>> a huge lag. I'll first confirm your claims. I can start working on this but >>>> I think I'll need your or some other contributers' reviewing as well . I'll >>>> do this if it's possible. >>>> >>>> On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman <joel.noth...@gmail.com> >>>> wrote: >>>> >>>>> @Maheshakya, will you be able to do work in the near future on >>>>> speeding up the ascending phase instead? Or should another contributor >>>>> take >>>>> that baton? Not only does it seem to be a major contributor to runtime, >>>>> but >>>>> it is independent of metric and hashing mechanism (within binary hashes), >>>>> and hence the most fundamental component of LSHForest.T >>>>> >>>>> On 30 July 2015 at 22:28, Joel Nothman <joel.noth...@gmail.com> wrote: >>>>> >>>>>> (sorry, I should have said the first b layers, not 2**b layers, >>>>>> producing a memoization of 2**b offsets >>>>>> >>>>>> On 30 July 2015 at 22:22, Joel Nothman <joel.noth...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> One approach to fixing the ascending phase would ensure that >>>>>>> _find_matching_indices is only searching over parts of the tree that >>>>>>> have >>>>>>> not yet been explored, while currently it searches over the entire >>>>>>> index at >>>>>>> each depth. >>>>>>> >>>>>>> My preferred, but more experimental, solution is to memoize where >>>>>>> the first 2**b layers of the tree begin and end in the index, for small >>>>>>> b. >>>>>>> So if our index stored: >>>>>>> [[0, 0, 0, 1, 1, 0, 0, 0], >>>>>>> [0, 0, 1, 0, 1, 0, 1, 0], >>>>>>> [0, 0, 1, 0, 1, 0, 1, 0], >>>>>>> [0, 1, 0, 0, 0, 0, 0, 0], >>>>>>> [0, 1, 0, 1, 1, 0, 0, 0], >>>>>>> [0, 1, 1, 0, 0, 1, 1, 0], >>>>>>> [1, 0, 0, 0, 0, 1, 0, 1], >>>>>>> [1, 0, 0, 1, 0, 1, 0, 1], >>>>>>> [1, 1, 0, 0, 0, 0, 0, 0], >>>>>>> [1, 1, 1, 1, 1, 0, 0, 0]] >>>>>>> and b=2, we'd memoize offsets for prefixes of size 2: >>>>>>> [0, # 00 >>>>>>> 3, # 01 >>>>>>> 6, # 10 >>>>>>> 8, # 11 >>>>>>> ] >>>>>>> >>>>>>> Given a query like [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift >>>>>>> down to leave the first b bits [0, 1] remaining, and look them up in the >>>>>>> array just defined to identify a much narrower search space [3, 6) >>>>>>> matching >>>>>>> that prefix in the overall index. >>>>>>> >>>>>>> Indeed, given the min_hash_match constraint, not having this sort of >>>>>>> thing for b >= min_hash_match seems wasteful. >>>>>>> >>>>>>> This provides us O(1) access to the top layers of the tree when >>>>>>> ascending, and makes the searchsorted calls run in log(n / (2 ** b)) >>>>>>> time >>>>>>> rather than log(n). It is also much more like traditional LSH. However, >>>>>>> it >>>>>>> complexifies the code, as we now have to consider two strategies for >>>>>>> descent/ascent. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 30 July 2015 at 21:46, Joel Nothman <joel.noth...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> What makes you think this is the main bottleneck? While it is not >>>>>>>> an insignificant consumer of time, I really doubt this is what's making >>>>>>>> scikit-learn's LSH implementation severely underperform with respect to >>>>>>>> other implementations. >>>>>>>> >>>>>>>> We need to profile. In order to do that, we need some sensible >>>>>>>> parameters that users might actually want, e.g. number of features for >>>>>>>> {dense, sparse} cases, index size, target 10NN precision and recall >>>>>>>> (selecting corresponding n_estimators and n_candidates). Ideally we'd >>>>>>>> consider real-world datasets. And of course, these should be sensible >>>>>>>> for >>>>>>>> whichever metric we're operating over, and whether we're doing KNN or >>>>>>>> Radius searches. >>>>>>>> >>>>>>>> I don't know if it's realistic, but I've profiled the following >>>>>>>> bench_plot_approximate_neighbors settings: >>>>>>>> >>>>>>>> Building NearestNeighbors for 100000 samples in 100 dimensions >>>>>>>> LSHF parameters: n_estimators = 15, n_candidates = 100 >>>>>>>> Building LSHForest for 100000 samples in 100 dimensions >>>>>>>> Done in 1.492s >>>>>>>> Average time for lshf neighbor queries: 0.005s >>>>>>>> Average time for exact neighbor queries: 0.002s >>>>>>>> Average Accuracy : 0.88 >>>>>>>> Speed up: 0.5x >>>>>>>> >>>>>>>> Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query >>>>>>>> matrix, we have: >>>>>>>> >>>>>>>> - 0.03 spent in _query (hashing and descending) >>>>>>>> - 0.91 spent in _compute_distances (exact distance calculation) >>>>>>>> - 3.80 remaining in _get_candidates (ascending phase), almost >>>>>>>> all of which is spent in _find_matching_indices >>>>>>>> >>>>>>>> Cutting exact distance calculation to 0s will still not get this >>>>>>>> faster than the exact approach. Of course, your mileage may vary, but >>>>>>>> this >>>>>>>> suggests to me you're barking up the wrong tree (no pun intended). >>>>>>>> >>>>>>>> On 30 July 2015 at 19:43, Maheshakya Wijewardena < >>>>>>>> pmaheshak...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I've started to look into the matter of improving performance of >>>>>>>>> LSHForest. As we have discussed sometime before(in fact, quite a long >>>>>>>>> time), main concern is to Cythonize distance calculations. Currently, >>>>>>>>> this >>>>>>>>> done by iteratively moving over all the query vectors when >>>>>>>>> `kneighbors` >>>>>>>>> method is called for a set of query vectors. It has been identified >>>>>>>>> that >>>>>>>>> iterating over each query with Python loops is a huge overhead. I have >>>>>>>>> implemented a few Cython hacks to demonstrate the distance >>>>>>>>> calculation in >>>>>>>>> LSHForest and I was able to get an approximate speedup 10x compared to >>>>>>>>> current distance calculation with a Python loop. However, I came >>>>>>>>> across >>>>>>>>> some blockers while trying to do this and need some clarifications. >>>>>>>>> >>>>>>>>> What I need to know is, do we use a mechanism to release GIL when >>>>>>>>> we want to parallelize. One of my observations is `pairwise_distance` >>>>>>>>> uses >>>>>>>>> all the cores even when I don't specify `n_jobs` parameter which is 1 >>>>>>>>> in >>>>>>>>> default. Is this an expected behavior? >>>>>>>>> >>>>>>>>> If I want to release GIL, can I use OpenMP module in Cython? Or is >>>>>>>>> that a task of Joblib? >>>>>>>>> Any input on this is highly appreciated. >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> -- >>>>>>>>> >>>>>>>>> *Maheshakya Wijewardena,Undergraduate,* >>>>>>>>> *Department of Computer Science and Engineering,* >>>>>>>>> *Faculty of Engineering.* >>>>>>>>> *University of Moratuwa,* >>>>>>>>> *Sri Lanka* >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Scikit-learn-general mailing list >>>>>>>>> Scikit-learn-general@lists.sourceforge.net >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> *Maheshakya Wijewardena,Undergraduate,* >>>> *Department of Computer Science and Engineering,* >>>> *Faculty of Engineering.* >>>> *University of Moratuwa,* >>>> *Sri Lanka* >>>> >>> >>> >>> >>> -- >>> >>> *Maheshakya Wijewardena,Undergraduate,* >>> *Department of Computer Science and Engineering,* >>> *Faculty of Engineering.* >>> *University of Moratuwa,* >>> *Sri Lanka* >>> >> >> > > > -- > > *Maheshakya Wijewardena,Undergraduate,* > *Department of Computer Science and Engineering,* > *Faculty of Engineering.* > *University of Moratuwa,* > *Sri Lanka* >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general