Hi Joel,
I was on vacation during past 3 days. I''ll look into this asap and let you
all know.

I also did some profiling, but only with the usage of `pairwise_distance`
method. Brute force technique directly uses that for the entire query
array, but LSH uses that in a loop and I noticed there is a huge lag. I'll
first confirm your claims. I can start working on this but I think I'll
need your or some other contributers' reviewing as well . I'll do this if
it's possible.

On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman <joel.noth...@gmail.com> wrote:

> @Maheshakya, will you be able to do work in the near future on speeding up
> the ascending phase instead? Or should another contributor take that baton?
> Not only does it seem to be a major contributor to runtime, but it is
> independent of metric and hashing mechanism (within binary hashes), and
> hence the most fundamental component of LSHForest.T
>
> On 30 July 2015 at 22:28, Joel Nothman <joel.noth...@gmail.com> wrote:
>
>> (sorry, I should have said the first b layers, not 2**b layers, producing
>> a memoization of 2**b offsets
>>
>> On 30 July 2015 at 22:22, Joel Nothman <joel.noth...@gmail.com> wrote:
>>
>>> One approach to fixing the ascending phase would ensure that
>>> _find_matching_indices is only searching over parts of the tree that have
>>> not yet been explored, while currently it searches over the entire index at
>>> each depth.
>>>
>>> My preferred, but more experimental, solution is to memoize where the
>>> first 2**b layers of the tree begin and end in the index, for small b. So
>>> if our index stored:
>>> [[0, 0, 0, 1, 1, 0, 0, 0],
>>>  [0, 0, 1, 0, 1, 0, 1, 0],
>>>  [0, 0, 1, 0, 1, 0, 1, 0],
>>>  [0, 1, 0, 0, 0, 0, 0, 0],
>>>  [0, 1, 0, 1, 1, 0, 0, 0],
>>>  [0, 1, 1, 0, 0, 1, 1, 0],
>>>  [1, 0, 0, 0, 0, 1, 0, 1],
>>>  [1, 0, 0, 1, 0, 1, 0, 1],
>>>  [1, 1, 0, 0, 0, 0, 0, 0],
>>>  [1, 1, 1, 1, 1, 0, 0, 0]]
>>> and b=2, we'd memoize offsets for prefixes of size 2:
>>> [0, # 00
>>>  3, # 01
>>>  6, # 10
>>>  8, # 11
>>> ]
>>>
>>> Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
>>> leave the first b bits [0, 1] remaining, and look them up in the array just
>>> defined to identify a much narrower search space [3, 6) matching that
>>> prefix in the overall index.
>>>
>>> Indeed, given the min_hash_match constraint, not having this sort of
>>> thing for b >= min_hash_match seems wasteful.
>>>
>>> This provides us O(1) access to the top layers of the tree when
>>> ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
>>> rather than log(n). It is also much more like traditional LSH. However, it
>>> complexifies the code, as we now have to consider two strategies for
>>> descent/ascent.
>>>
>>>
>>>
>>> On 30 July 2015 at 21:46, Joel Nothman <joel.noth...@gmail.com> wrote:
>>>
>>>> What makes you think this is the main bottleneck? While it is not an
>>>> insignificant consumer of time, I really doubt this is what's making
>>>> scikit-learn's LSH implementation severely underperform with respect to
>>>> other implementations.
>>>>
>>>> We need to profile. In order to do that, we need some sensible
>>>> parameters that users might actually want, e.g. number of features for
>>>> {dense, sparse} cases, index size, target 10NN precision and recall
>>>> (selecting corresponding n_estimators and n_candidates). Ideally we'd
>>>> consider real-world datasets. And of course, these should be sensible for
>>>> whichever metric we're operating over, and whether we're doing KNN or
>>>> Radius searches.
>>>>
>>>> I don't know if it's realistic, but I've profiled the following
>>>> bench_plot_approximate_neighbors settings:
>>>>
>>>> Building NearestNeighbors for 100000 samples in 100 dimensions
>>>> LSHF parameters: n_estimators = 15, n_candidates = 100
>>>> Building LSHForest for 100000 samples in 100 dimensions
>>>> Done in 1.492s
>>>> Average time for lshf neighbor queries: 0.005s
>>>> Average time for exact neighbor queries: 0.002s
>>>> Average Accuracy : 0.88
>>>> Speed up: 0.5x
>>>>
>>>> Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
>>>> matrix, we have:
>>>>
>>>>    - 0.03 spent in _query (hashing and descending)
>>>>    - 0.91 spent in _compute_distances (exact distance calculation)
>>>>    - 3.80 remaining in _get_candidates (ascending phase), almost all
>>>>    of which is spent in _find_matching_indices
>>>>
>>>> Cutting exact distance calculation to 0s will still not get this faster
>>>> than the exact approach. Of course, your mileage may vary, but this
>>>> suggests to me you're barking up the wrong tree (no pun intended).
>>>>
>>>> On 30 July 2015 at 19:43, Maheshakya Wijewardena <
>>>> pmaheshak...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've started to look into the matter of improving performance of
>>>>> LSHForest. As we have discussed sometime before(in fact, quite a long
>>>>> time), main concern is to Cythonize distance calculations. Currently, this
>>>>> done by iteratively moving over all the query vectors when `kneighbors`
>>>>> method is called for a set of query vectors. It has been identified that
>>>>> iterating over each query with Python loops is a huge overhead. I have
>>>>> implemented a few Cython hacks to demonstrate the distance calculation in
>>>>> LSHForest and I was able to get an approximate speedup 10x compared to
>>>>> current distance calculation with a Python loop. However,  I came across
>>>>> some blockers while trying to do this and need some clarifications.
>>>>>
>>>>> What I need to know is, do we use a mechanism to release GIL when we
>>>>> want to parallelize. One of my observations is `pairwise_distance` uses 
>>>>> all
>>>>> the cores even when I don't specify `n_jobs` parameter which is 1 in
>>>>> default. Is this an expected behavior?
>>>>>
>>>>> If I want to release GIL, can I use OpenMP module in Cython? Or is
>>>>> that a task of Joblib?
>>>>> Any input on this is highly appreciated.
>>>>>
>>>>> Best regards,
>>>>> --
>>>>>
>>>>> *Maheshakya Wijewardena,Undergraduate,*
>>>>> *Department of Computer Science and Engineering,*
>>>>> *Faculty of Engineering.*
>>>>> *University of Moratuwa,*
>>>>> *Sri Lanka*
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 

*Maheshakya Wijewardena,Undergraduate,*
*Department of Computer Science and Engineering,*
*Faculty of Engineering.*
*University of Moratuwa,*
*Sri Lanka*
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to