Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

Maheshakya Wijewardena Sun, 02 Aug 2015 06:00:07 -0700

I agree with Joel. Profiling indicated that 69.8% of total time of
kneighbors is spent on _find_matching_indices and 22.9% is spent on
_compute_distances. So I'll give priority to work on _find_matching_indices
with the method you suggested.


On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena <
pmaheshak...@gmail.com> wrote:

> Hi Joel,
> I was on vacation during past 3 days. I''ll look into this asap and let
> you all know.
>
> I also did some profiling, but only with the usage of `pairwise_distance`
> method. Brute force technique directly uses that for the entire query
> array, but LSH uses that in a loop and I noticed there is a huge lag. I'll
> first confirm your claims. I can start working on this but I think I'll
> need your or some other contributers' reviewing as well . I'll do this if
> it's possible.
>
> On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman <joel.noth...@gmail.com>
> wrote:
>
>> @Maheshakya, will you be able to do work in the near future on speeding
>> up the ascending phase instead? Or should another contributor take that
>> baton? Not only does it seem to be a major contributor to runtime, but it
>> is independent of metric and hashing mechanism (within binary hashes), and
>> hence the most fundamental component of LSHForest.T
>>
>> On 30 July 2015 at 22:28, Joel Nothman <joel.noth...@gmail.com> wrote:
>>
>>> (sorry, I should have said the first b layers, not 2**b layers,
>>> producing a memoization of 2**b offsets
>>>
>>> On 30 July 2015 at 22:22, Joel Nothman <joel.noth...@gmail.com> wrote:
>>>
>>>> One approach to fixing the ascending phase would ensure that
>>>> _find_matching_indices is only searching over parts of the tree that have
>>>> not yet been explored, while currently it searches over the entire index at
>>>> each depth.
>>>>
>>>> My preferred, but more experimental, solution is to memoize where the
>>>> first 2**b layers of the tree begin and end in the index, for small b. So
>>>> if our index stored:
>>>> [[0, 0, 0, 1, 1, 0, 0, 0],
>>>>  [0, 0, 1, 0, 1, 0, 1, 0],
>>>>  [0, 0, 1, 0, 1, 0, 1, 0],
>>>>  [0, 1, 0, 0, 0, 0, 0, 0],
>>>>  [0, 1, 0, 1, 1, 0, 0, 0],
>>>>  [0, 1, 1, 0, 0, 1, 1, 0],
>>>>  [1, 0, 0, 0, 0, 1, 0, 1],
>>>>  [1, 0, 0, 1, 0, 1, 0, 1],
>>>>  [1, 1, 0, 0, 0, 0, 0, 0],
>>>>  [1, 1, 1, 1, 1, 0, 0, 0]]
>>>> and b=2, we'd memoize offsets for prefixes of size 2:
>>>> [0, # 00
>>>>  3, # 01
>>>>  6, # 10
>>>>  8, # 11
>>>> ]
>>>>
>>>> Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down
>>>> to leave the first b bits [0, 1] remaining, and look them up in the array
>>>> just defined to identify a much narrower search space [3, 6) matching that
>>>> prefix in the overall index.
>>>>
>>>> Indeed, given the min_hash_match constraint, not having this sort of
>>>> thing for b >= min_hash_match seems wasteful.
>>>>
>>>> This provides us O(1) access to the top layers of the tree when
>>>> ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
>>>> rather than log(n). It is also much more like traditional LSH. However, it
>>>> complexifies the code, as we now have to consider two strategies for
>>>> descent/ascent.
>>>>
>>>>
>>>>
>>>> On 30 July 2015 at 21:46, Joel Nothman <joel.noth...@gmail.com> wrote:
>>>>
>>>>> What makes you think this is the main bottleneck? While it is not an
>>>>> insignificant consumer of time, I really doubt this is what's making
>>>>> scikit-learn's LSH implementation severely underperform with respect to
>>>>> other implementations.
>>>>>
>>>>> We need to profile. In order to do that, we need some sensible
>>>>> parameters that users might actually want, e.g. number of features for
>>>>> {dense, sparse} cases, index size, target 10NN precision and recall
>>>>> (selecting corresponding n_estimators and n_candidates). Ideally we'd
>>>>> consider real-world datasets. And of course, these should be sensible for
>>>>> whichever metric we're operating over, and whether we're doing KNN or
>>>>> Radius searches.
>>>>>
>>>>> I don't know if it's realistic, but I've profiled the following
>>>>> bench_plot_approximate_neighbors settings:
>>>>>
>>>>> Building NearestNeighbors for 100000 samples in 100 dimensions
>>>>> LSHF parameters: n_estimators = 15, n_candidates = 100
>>>>> Building LSHForest for 100000 samples in 100 dimensions
>>>>> Done in 1.492s
>>>>> Average time for lshf neighbor queries: 0.005s
>>>>> Average time for exact neighbor queries: 0.002s
>>>>> Average Accuracy : 0.88
>>>>> Speed up: 0.5x
>>>>>
>>>>> Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
>>>>> matrix, we have:
>>>>>
>>>>>    - 0.03 spent in _query (hashing and descending)
>>>>>    - 0.91 spent in _compute_distances (exact distance calculation)
>>>>>    - 3.80 remaining in _get_candidates (ascending phase), almost all
>>>>>    of which is spent in _find_matching_indices
>>>>>
>>>>> Cutting exact distance calculation to 0s will still not get this
>>>>> faster than the exact approach. Of course, your mileage may vary, but this
>>>>> suggests to me you're barking up the wrong tree (no pun intended).
>>>>>
>>>>> On 30 July 2015 at 19:43, Maheshakya Wijewardena <
>>>>> pmaheshak...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've started to look into the matter of improving performance of
>>>>>> LSHForest. As we have discussed sometime before(in fact, quite a long
>>>>>> time), main concern is to Cythonize distance calculations. Currently, 
>>>>>> this
>>>>>> done by iteratively moving over all the query vectors when `kneighbors`
>>>>>> method is called for a set of query vectors. It has been identified that
>>>>>> iterating over each query with Python loops is a huge overhead. I have
>>>>>> implemented a few Cython hacks to demonstrate the distance calculation in
>>>>>> LSHForest and I was able to get an approximate speedup 10x compared to
>>>>>> current distance calculation with a Python loop. However,  I came across
>>>>>> some blockers while trying to do this and need some clarifications.
>>>>>>
>>>>>> What I need to know is, do we use a mechanism to release GIL when we
>>>>>> want to parallelize. One of my observations is `pairwise_distance` uses 
>>>>>> all
>>>>>> the cores even when I don't specify `n_jobs` parameter which is 1 in
>>>>>> default. Is this an expected behavior?
>>>>>>
>>>>>> If I want to release GIL, can I use OpenMP module in Cython? Or is
>>>>>> that a task of Joblib?
>>>>>> Any input on this is highly appreciated.
>>>>>>
>>>>>> Best regards,
>>>>>> --
>>>>>>
>>>>>> *Maheshakya Wijewardena,Undergraduate,*
>>>>>> *Department of Computer Science and Engineering,*
>>>>>> *Faculty of Engineering.*
>>>>>> *University of Moratuwa,*
>>>>>> *Sri Lanka*
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
>
> *Maheshakya Wijewardena,Undergraduate,*
> *Department of Computer Science and Engineering,*
> *Faculty of Engineering.*
> *University of Moratuwa,*
> *Sri Lanka*
>



-- 

*Maheshakya Wijewardena,Undergraduate,*
*Department of Computer Science and Engineering,*
*Faculty of Engineering.*
*University of Moratuwa,*
*Sri Lanka*

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

Reply via email to