It's nice to see some decent speed-up factors, though the accuracy tradeoff
is still not so great. Still, I'd like to see the code and where we can go
from here. Great work so far!

On 7 August 2015 at 07:50, Maheshakya Wijewardena <pmaheshak...@gmail.com>
wrote:

> I did a rough implementation, setting b = min_hash_match. The result I got
> from running the benchmark is attached. It was able to roughly triple the
> speed-up of kneighbors function for large index sizes. However, this
> implementation adds some overhead to fitting time as there are 2**b *
> n_estimators times numpy searchsorted calls during training. But that may
> most probably be compensated as the number of queries grow since 2**b *
> n_estimators is a constant time.
>
> I'll send a PR with proper refactoring.
>
>
> On Sun, Aug 2, 2015 at 6:41 PM, Joel Nothman <joel.noth...@gmail.com>
> wrote:
>
>> Thanks, I look forward to this being improved, while I have little
>> availability to help myself atm.
>>
>> On 2 August 2015 at 22:58, Maheshakya Wijewardena <pmaheshak...@gmail.com
>> > wrote:
>>
>>> I agree with Joel. Profiling indicated that 69.8% of total time of
>>> kneighbors is spent on _find_matching_indices and 22.9% is spent on
>>> _compute_distances. So I'll give priority to work on _find_matching_indices
>>> with the method you suggested.
>>>
>>> On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena <
>>> pmaheshak...@gmail.com> wrote:
>>>
>>>> Hi Joel,
>>>> I was on vacation during past 3 days. I''ll look into this asap and let
>>>> you all know.
>>>>
>>>> I also did some profiling, but only with the usage of
>>>> `pairwise_distance` method. Brute force technique directly uses that for
>>>> the entire query array, but LSH uses that in a loop and I noticed there is
>>>> a huge lag. I'll first confirm your claims. I can start working on this but
>>>> I think I'll need your or some other contributers' reviewing as well . I'll
>>>> do this if it's possible.
>>>>
>>>> On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman <joel.noth...@gmail.com>
>>>> wrote:
>>>>
>>>>> @Maheshakya, will you be able to do work in the near future on
>>>>> speeding up the ascending phase instead? Or should another contributor 
>>>>> take
>>>>> that baton? Not only does it seem to be a major contributor to runtime, 
>>>>> but
>>>>> it is independent of metric and hashing mechanism (within binary hashes),
>>>>> and hence the most fundamental component of LSHForest.T
>>>>>
>>>>> On 30 July 2015 at 22:28, Joel Nothman <joel.noth...@gmail.com> wrote:
>>>>>
>>>>>> (sorry, I should have said the first b layers, not 2**b layers,
>>>>>> producing a memoization of 2**b offsets
>>>>>>
>>>>>> On 30 July 2015 at 22:22, Joel Nothman <joel.noth...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> One approach to fixing the ascending phase would ensure that
>>>>>>> _find_matching_indices is only searching over parts of the tree that 
>>>>>>> have
>>>>>>> not yet been explored, while currently it searches over the entire 
>>>>>>> index at
>>>>>>> each depth.
>>>>>>>
>>>>>>> My preferred, but more experimental, solution is to memoize where
>>>>>>> the first 2**b layers of the tree begin and end in the index, for small 
>>>>>>> b.
>>>>>>> So if our index stored:
>>>>>>> [[0, 0, 0, 1, 1, 0, 0, 0],
>>>>>>>  [0, 0, 1, 0, 1, 0, 1, 0],
>>>>>>>  [0, 0, 1, 0, 1, 0, 1, 0],
>>>>>>>  [0, 1, 0, 0, 0, 0, 0, 0],
>>>>>>>  [0, 1, 0, 1, 1, 0, 0, 0],
>>>>>>>  [0, 1, 1, 0, 0, 1, 1, 0],
>>>>>>>  [1, 0, 0, 0, 0, 1, 0, 1],
>>>>>>>  [1, 0, 0, 1, 0, 1, 0, 1],
>>>>>>>  [1, 1, 0, 0, 0, 0, 0, 0],
>>>>>>>  [1, 1, 1, 1, 1, 0, 0, 0]]
>>>>>>> and b=2, we'd memoize offsets for prefixes of size 2:
>>>>>>> [0, # 00
>>>>>>>  3, # 01
>>>>>>>  6, # 10
>>>>>>>  8, # 11
>>>>>>> ]
>>>>>>>
>>>>>>> Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift
>>>>>>> down to leave the first b bits [0, 1] remaining, and look them up in the
>>>>>>> array just defined to identify a much narrower search space [3, 6) 
>>>>>>> matching
>>>>>>> that prefix in the overall index.
>>>>>>>
>>>>>>> Indeed, given the min_hash_match constraint, not having this sort of
>>>>>>> thing for b >= min_hash_match seems wasteful.
>>>>>>>
>>>>>>> This provides us O(1) access to the top layers of the tree when
>>>>>>> ascending, and makes the searchsorted calls run in log(n / (2 ** b)) 
>>>>>>> time
>>>>>>> rather than log(n). It is also much more like traditional LSH. However, 
>>>>>>> it
>>>>>>> complexifies the code, as we now have to consider two strategies for
>>>>>>> descent/ascent.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 30 July 2015 at 21:46, Joel Nothman <joel.noth...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> What makes you think this is the main bottleneck? While it is not
>>>>>>>> an insignificant consumer of time, I really doubt this is what's making
>>>>>>>> scikit-learn's LSH implementation severely underperform with respect to
>>>>>>>> other implementations.
>>>>>>>>
>>>>>>>> We need to profile. In order to do that, we need some sensible
>>>>>>>> parameters that users might actually want, e.g. number of features for
>>>>>>>> {dense, sparse} cases, index size, target 10NN precision and recall
>>>>>>>> (selecting corresponding n_estimators and n_candidates). Ideally we'd
>>>>>>>> consider real-world datasets. And of course, these should be sensible 
>>>>>>>> for
>>>>>>>> whichever metric we're operating over, and whether we're doing KNN or
>>>>>>>> Radius searches.
>>>>>>>>
>>>>>>>> I don't know if it's realistic, but I've profiled the following
>>>>>>>> bench_plot_approximate_neighbors settings:
>>>>>>>>
>>>>>>>> Building NearestNeighbors for 100000 samples in 100 dimensions
>>>>>>>> LSHF parameters: n_estimators = 15, n_candidates = 100
>>>>>>>> Building LSHForest for 100000 samples in 100 dimensions
>>>>>>>> Done in 1.492s
>>>>>>>> Average time for lshf neighbor queries: 0.005s
>>>>>>>> Average time for exact neighbor queries: 0.002s
>>>>>>>> Average Accuracy : 0.88
>>>>>>>> Speed up: 0.5x
>>>>>>>>
>>>>>>>> Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
>>>>>>>> matrix, we have:
>>>>>>>>
>>>>>>>>    - 0.03 spent in _query (hashing and descending)
>>>>>>>>    - 0.91 spent in _compute_distances (exact distance calculation)
>>>>>>>>    - 3.80 remaining in _get_candidates (ascending phase), almost
>>>>>>>>    all of which is spent in _find_matching_indices
>>>>>>>>
>>>>>>>> Cutting exact distance calculation to 0s will still not get this
>>>>>>>> faster than the exact approach. Of course, your mileage may vary, but 
>>>>>>>> this
>>>>>>>> suggests to me you're barking up the wrong tree (no pun intended).
>>>>>>>>
>>>>>>>> On 30 July 2015 at 19:43, Maheshakya Wijewardena <
>>>>>>>> pmaheshak...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've started to look into the matter of improving performance of
>>>>>>>>> LSHForest. As we have discussed sometime before(in fact, quite a long
>>>>>>>>> time), main concern is to Cythonize distance calculations. Currently, 
>>>>>>>>> this
>>>>>>>>> done by iteratively moving over all the query vectors when 
>>>>>>>>> `kneighbors`
>>>>>>>>> method is called for a set of query vectors. It has been identified 
>>>>>>>>> that
>>>>>>>>> iterating over each query with Python loops is a huge overhead. I have
>>>>>>>>> implemented a few Cython hacks to demonstrate the distance 
>>>>>>>>> calculation in
>>>>>>>>> LSHForest and I was able to get an approximate speedup 10x compared to
>>>>>>>>> current distance calculation with a Python loop. However,  I came 
>>>>>>>>> across
>>>>>>>>> some blockers while trying to do this and need some clarifications.
>>>>>>>>>
>>>>>>>>> What I need to know is, do we use a mechanism to release GIL when
>>>>>>>>> we want to parallelize. One of my observations is `pairwise_distance` 
>>>>>>>>> uses
>>>>>>>>> all the cores even when I don't specify `n_jobs` parameter which is 1 
>>>>>>>>> in
>>>>>>>>> default. Is this an expected behavior?
>>>>>>>>>
>>>>>>>>> If I want to release GIL, can I use OpenMP module in Cython? Or is
>>>>>>>>> that a task of Joblib?
>>>>>>>>> Any input on this is highly appreciated.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> *Maheshakya Wijewardena,Undergraduate,*
>>>>>>>>> *Department of Computer Science and Engineering,*
>>>>>>>>> *Faculty of Engineering.*
>>>>>>>>> *University of Moratuwa,*
>>>>>>>>> *Sri Lanka*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *Maheshakya Wijewardena,Undergraduate,*
>>>> *Department of Computer Science and Engineering,*
>>>> *Faculty of Engineering.*
>>>> *University of Moratuwa,*
>>>> *Sri Lanka*
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> *Maheshakya Wijewardena,Undergraduate,*
>>> *Department of Computer Science and Engineering,*
>>> *Faculty of Engineering.*
>>> *University of Moratuwa,*
>>> *Sri Lanka*
>>>
>>
>>
>
>
> --
>
> *Maheshakya Wijewardena,Undergraduate,*
> *Department of Computer Science and Engineering,*
> *Faculty of Engineering.*
> *University of Moratuwa,*
> *Sri Lanka*
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to