Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

Joel Nothman Thu, 30 Jul 2015 05:30:07 -0700

(sorry, I should have said the first b layers, not 2**b layers, producing a
memoization of 2**b offsets)


On 30 July 2015 at 22:22, Joel Nothman <joel.noth...@gmail.com> wrote:

> One approach to fixing the ascending phase would ensure that
> _find_matching_indices is only searching over parts of the tree that have
> not yet been explored, while currently it searches over the entire index at
> each depth.
>
> My preferred, but more experimental, solution is to memoize where the
> first 2**b layers of the tree begin and end in the index, for small b. So
> if our index stored:
> [[0, 0, 0, 1, 1, 0, 0, 0],
>  [0, 0, 1, 0, 1, 0, 1, 0],
>  [0, 0, 1, 0, 1, 0, 1, 0],
>  [0, 1, 0, 0, 0, 0, 0, 0],
>  [0, 1, 0, 1, 1, 0, 0, 0],
>  [0, 1, 1, 0, 0, 1, 1, 0],
>  [1, 0, 0, 0, 0, 1, 0, 1],
>  [1, 0, 0, 1, 0, 1, 0, 1],
>  [1, 1, 0, 0, 0, 0, 0, 0],
>  [1, 1, 1, 1, 1, 0, 0, 0]]
> and b=2, we'd memoize offsets for prefixes of size 2:
> [0, # 00
>  3, # 01
>  6, # 10
>  8, # 11
> ]
>
> Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
> leave the first b bits [0, 1] remaining, and look them up in the array just
> defined to identify a much narrower search space [3, 6) matching that
> prefix in the overall index.
>
> Indeed, given the min_hash_match constraint, not having this sort of thing
> for b >= min_hash_match seems wasteful.
>
> This provides us O(1) access to the top layers of the tree when ascending,
> and makes the searchsorted calls run in log(n / (2 ** b)) time rather than
> log(n). It is also much more like traditional LSH. However, it complexifies
> the code, as we now have to consider two strategies for descent/ascent.
>
>
>
> On 30 July 2015 at 21:46, Joel Nothman <joel.noth...@gmail.com> wrote:
>
>> What makes you think this is the main bottleneck? While it is not an
>> insignificant consumer of time, I really doubt this is what's making
>> scikit-learn's LSH implementation severely underperform with respect to
>> other implementations.
>>
>> We need to profile. In order to do that, we need some sensible parameters
>> that users might actually want, e.g. number of features for {dense, sparse}
>> cases, index size, target 10NN precision and recall (selecting
>> corresponding n_estimators and n_candidates). Ideally we'd consider
>> real-world datasets. And of course, these should be sensible for whichever
>> metric we're operating over, and whether we're doing KNN or Radius searches.
>>
>> I don't know if it's realistic, but I've profiled the following
>> bench_plot_approximate_neighbors settings:
>>
>> Building NearestNeighbors for 100000 samples in 100 dimensions
>> LSHF parameters: n_estimators = 15, n_candidates = 100
>> Building LSHForest for 100000 samples in 100 dimensions
>> Done in 1.492s
>> Average time for lshf neighbor queries: 0.005s
>> Average time for exact neighbor queries: 0.002s
>> Average Accuracy : 0.88
>> Speed up: 0.5x
>>
>> Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
>> matrix, we have:
>>
>>    - 0.03 spent in _query (hashing and descending)
>>    - 0.91 spent in _compute_distances (exact distance calculation)
>>    - 3.80 remaining in _get_candidates (ascending phase), almost all of
>>    which is spent in _find_matching_indices
>>
>> Cutting exact distance calculation to 0s will still not get this faster
>> than the exact approach. Of course, your mileage may vary, but this
>> suggests to me you're barking up the wrong tree (no pun intended).
>>
>> On 30 July 2015 at 19:43, Maheshakya Wijewardena <pmaheshak...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I've started to look into the matter of improving performance of
>>> LSHForest. As we have discussed sometime before(in fact, quite a long
>>> time), main concern is to Cythonize distance calculations. Currently, this
>>> done by iteratively moving over all the query vectors when `kneighbors`
>>> method is called for a set of query vectors. It has been identified that
>>> iterating over each query with Python loops is a huge overhead. I have
>>> implemented a few Cython hacks to demonstrate the distance calculation in
>>> LSHForest and I was able to get an approximate speedup 10x compared to
>>> current distance calculation with a Python loop. However,  I came across
>>> some blockers while trying to do this and need some clarifications.
>>>
>>> What I need to know is, do we use a mechanism to release GIL when we
>>> want to parallelize. One of my observations is `pairwise_distance` uses all
>>> the cores even when I don't specify `n_jobs` parameter which is 1 in
>>> default. Is this an expected behavior?
>>>
>>> If I want to release GIL, can I use OpenMP module in Cython? Or is that
>>> a task of Joblib?
>>> Any input on this is highly appreciated.
>>>
>>> Best regards,
>>> --
>>>
>>> *Maheshakya Wijewardena,Undergraduate,*
>>> *Department of Computer Science and Engineering,*
>>> *Faculty of Engineering.*
>>> *University of Moratuwa,*
>>> *Sri Lanka*
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

Reply via email to