Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

Joel Nothman Sun, 02 Aug 2015 06:12:24 -0700

Thanks, I look forward to this being improved, while I have little
availability to help myself atm.


On 2 August 2015 at 22:58, Maheshakya Wijewardena <pmaheshak...@gmail.com>
wrote:

> I agree with Joel. Profiling indicated that 69.8% of total time of
> kneighbors is spent on _find_matching_indices and 22.9% is spent on
> _compute_distances. So I'll give priority to work on _find_matching_indices
> with the method you suggested.
>
> On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena <
> pmaheshak...@gmail.com> wrote:
>
>> Hi Joel,
>> I was on vacation during past 3 days. I''ll look into this asap and let
>> you all know.
>>
>> I also did some profiling, but only with the usage of `pairwise_distance`
>> method. Brute force technique directly uses that for the entire query
>> array, but LSH uses that in a loop and I noticed there is a huge lag. I'll
>> first confirm your claims. I can start working on this but I think I'll
>> need your or some other contributers' reviewing as well . I'll do this if
>> it's possible.
>>
>> On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman <joel.noth...@gmail.com>
>> wrote:
>>
>>> @Maheshakya, will you be able to do work in the near future on speeding
>>> up the ascending phase instead? Or should another contributor take that
>>> baton? Not only does it seem to be a major contributor to runtime, but it
>>> is independent of metric and hashing mechanism (within binary hashes), and
>>> hence the most fundamental component of LSHForest.T
>>>
>>> On 30 July 2015 at 22:28, Joel Nothman <joel.noth...@gmail.com> wrote:
>>>
>>>> (sorry, I should have said the first b layers, not 2**b layers,
>>>> producing a memoization of 2**b offsets
>>>>
>>>> On 30 July 2015 at 22:22, Joel Nothman <joel.noth...@gmail.com> wrote:
>>>>
>>>>> One approach to fixing the ascending phase would ensure that
>>>>> _find_matching_indices is only searching over parts of the tree that have
>>>>> not yet been explored, while currently it searches over the entire index 
>>>>> at
>>>>> each depth.
>>>>>
>>>>> My preferred, but more experimental, solution is to memoize where the
>>>>> first 2**b layers of the tree begin and end in the index, for small b. So
>>>>> if our index stored:
>>>>> [[0, 0, 0, 1, 1, 0, 0, 0],
>>>>>  [0, 0, 1, 0, 1, 0, 1, 0],
>>>>>  [0, 0, 1, 0, 1, 0, 1, 0],
>>>>>  [0, 1, 0, 0, 0, 0, 0, 0],
>>>>>  [0, 1, 0, 1, 1, 0, 0, 0],
>>>>>  [0, 1, 1, 0, 0, 1, 1, 0],
>>>>>  [1, 0, 0, 0, 0, 1, 0, 1],
>>>>>  [1, 0, 0, 1, 0, 1, 0, 1],
>>>>>  [1, 1, 0, 0, 0, 0, 0, 0],
>>>>>  [1, 1, 1, 1, 1, 0, 0, 0]]
>>>>> and b=2, we'd memoize offsets for prefixes of size 2:
>>>>> [0, # 00
>>>>>  3, # 01
>>>>>  6, # 10
>>>>>  8, # 11
>>>>> ]
>>>>>
>>>>> Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down
>>>>> to leave the first b bits [0, 1] remaining, and look them up in the array
>>>>> just defined to identify a much narrower search space [3, 6) matching that
>>>>> prefix in the overall index.
>>>>>
>>>>> Indeed, given the min_hash_match constraint, not having this sort of
>>>>> thing for b >= min_hash_match seems wasteful.
>>>>>
>>>>> This provides us O(1) access to the top layers of the tree when
>>>>> ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
>>>>> rather than log(n). It is also much more like traditional LSH. However, it
>>>>> complexifies the code, as we now have to consider two strategies for
>>>>> descent/ascent.
>>>>>
>>>>>
>>>>>
>>>>> On 30 July 2015 at 21:46, Joel Nothman <joel.noth...@gmail.com> wrote:
>>>>>
>>>>>> What makes you think this is the main bottleneck? While it is not an
>>>>>> insignificant consumer of time, I really doubt this is what's making
>>>>>> scikit-learn's LSH implementation severely underperform with respect to
>>>>>> other implementations.
>>>>>>
>>>>>> We need to profile. In order to do that, we need some sensible
>>>>>> parameters that users might actually want, e.g. number of features for
>>>>>> {dense, sparse} cases, index size, target 10NN precision and recall
>>>>>> (selecting corresponding n_estimators and n_candidates). Ideally we'd
>>>>>> consider real-world datasets. And of course, these should be sensible for
>>>>>> whichever metric we're operating over, and whether we're doing KNN or
>>>>>> Radius searches.
>>>>>>
>>>>>> I don't know if it's realistic, but I've profiled the following
>>>>>> bench_plot_approximate_neighbors settings:
>>>>>>
>>>>>> Building NearestNeighbors for 100000 samples in 100 dimensions
>>>>>> LSHF parameters: n_estimators = 15, n_candidates = 100
>>>>>> Building LSHForest for 100000 samples in 100 dimensions
>>>>>> Done in 1.492s
>>>>>> Average time for lshf neighbor queries: 0.005s
>>>>>> Average time for exact neighbor queries: 0.002s
>>>>>> Average Accuracy : 0.88
>>>>>> Speed up: 0.5x
>>>>>>
>>>>>> Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
>>>>>> matrix, we have:
>>>>>>
>>>>>>    - 0.03 spent in _query (hashing and descending)
>>>>>>    - 0.91 spent in _compute_distances (exact distance calculation)
>>>>>>    - 3.80 remaining in _get_candidates (ascending phase), almost all
>>>>>>    of which is spent in _find_matching_indices
>>>>>>
>>>>>> Cutting exact distance calculation to 0s will still not get this
>>>>>> faster than the exact approach. Of course, your mileage may vary, but 
>>>>>> this
>>>>>> suggests to me you're barking up the wrong tree (no pun intended).
>>>>>>
>>>>>> On 30 July 2015 at 19:43, Maheshakya Wijewardena <
>>>>>> pmaheshak...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've started to look into the matter of improving performance of
>>>>>>> LSHForest. As we have discussed sometime before(in fact, quite a long
>>>>>>> time), main concern is to Cythonize distance calculations. Currently, 
>>>>>>> this
>>>>>>> done by iteratively moving over all the query vectors when `kneighbors`
>>>>>>> method is called for a set of query vectors. It has been identified that
>>>>>>> iterating over each query with Python loops is a huge overhead. I have
>>>>>>> implemented a few Cython hacks to demonstrate the distance calculation 
>>>>>>> in
>>>>>>> LSHForest and I was able to get an approximate speedup 10x compared to
>>>>>>> current distance calculation with a Python loop. However,  I came across
>>>>>>> some blockers while trying to do this and need some clarifications.
>>>>>>>
>>>>>>> What I need to know is, do we use a mechanism to release GIL when we
>>>>>>> want to parallelize. One of my observations is `pairwise_distance` uses 
>>>>>>> all
>>>>>>> the cores even when I don't specify `n_jobs` parameter which is 1 in
>>>>>>> default. Is this an expected behavior?
>>>>>>>
>>>>>>> If I want to release GIL, can I use OpenMP module in Cython? Or is
>>>>>>> that a task of Joblib?
>>>>>>> Any input on this is highly appreciated.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> --
>>>>>>>
>>>>>>> *Maheshakya Wijewardena,Undergraduate,*
>>>>>>> *Department of Computer Science and Engineering,*
>>>>>>> *Faculty of Engineering.*
>>>>>>> *University of Moratuwa,*
>>>>>>> *Sri Lanka*
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>>
>> *Maheshakya Wijewardena,Undergraduate,*
>> *Department of Computer Science and Engineering,*
>> *Faculty of Engineering.*
>> *University of Moratuwa,*
>> *Sri Lanka*
>>
>
>
>
> --
>
> *Maheshakya Wijewardena,Undergraduate,*
> *Department of Computer Science and Engineering,*
> *Faculty of Engineering.*
> *University of Moratuwa,*
> *Sri Lanka*
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

Reply via email to