I did a rough implementation, setting b = min_hash_match. The result I got
from running the benchmark is attached. It was able to roughly triple the
speed-up of kneighbors function for large index sizes. However, this
implementation adds some overhead to fitting time as there are 2**b *
n_estimators times numpy searchsorted calls during training. But that may
most probably be compensated as the number of queries grow since 2**b *
n_estimators is a constant time.
I'll send a PR with proper refactoring.
On Sun, Aug 2, 2015 at 6:41 PM, Joel Nothman <joel.noth...@gmail.com> wrote:
> Thanks, I look forward to this being improved, while I have little
> availability to help myself atm.
>
> On 2 August 2015 at 22:58, Maheshakya Wijewardena <pmaheshak...@gmail.com>
> wrote:
>
>> I agree with Joel. Profiling indicated that 69.8% of total time of
>> kneighbors is spent on _find_matching_indices and 22.9% is spent on
>> _compute_distances. So I'll give priority to work on _find_matching_indices
>> with the method you suggested.
>>
>> On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena <
>> pmaheshak...@gmail.com> wrote:
>>
>>> Hi Joel,
>>> I was on vacation during past 3 days. I''ll look into this asap and let
>>> you all know.
>>>
>>> I also did some profiling, but only with the usage of
>>> `pairwise_distance` method. Brute force technique directly uses that for
>>> the entire query array, but LSH uses that in a loop and I noticed there is
>>> a huge lag. I'll first confirm your claims. I can start working on this but
>>> I think I'll need your or some other contributers' reviewing as well . I'll
>>> do this if it's possible.
>>>
>>> On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman <joel.noth...@gmail.com>
>>> wrote:
>>>
>>>> @Maheshakya, will you be able to do work in the near future on speeding
>>>> up the ascending phase instead? Or should another contributor take that
>>>> baton? Not only does it seem to be a major contributor to runtime, but it
>>>> is independent of metric and hashing mechanism (within binary hashes), and
>>>> hence the most fundamental component of LSHForest.T
>>>>
>>>> On 30 July 2015 at 22:28, Joel Nothman <joel.noth...@gmail.com> wrote:
>>>>
>>>>> (sorry, I should have said the first b layers, not 2**b layers,
>>>>> producing a memoization of 2**b offsets
>>>>>
>>>>> On 30 July 2015 at 22:22, Joel Nothman <joel.noth...@gmail.com> wrote:
>>>>>
>>>>>> One approach to fixing the ascending phase would ensure that
>>>>>> _find_matching_indices is only searching over parts of the tree that have
>>>>>> not yet been explored, while currently it searches over the entire index
>>>>>> at
>>>>>> each depth.
>>>>>>
>>>>>> My preferred, but more experimental, solution is to memoize where the
>>>>>> first 2**b layers of the tree begin and end in the index, for small b. So
>>>>>> if our index stored:
>>>>>> [[0, 0, 0, 1, 1, 0, 0, 0],
>>>>>> [0, 0, 1, 0, 1, 0, 1, 0],
>>>>>> [0, 0, 1, 0, 1, 0, 1, 0],
>>>>>> [0, 1, 0, 0, 0, 0, 0, 0],
>>>>>> [0, 1, 0, 1, 1, 0, 0, 0],
>>>>>> [0, 1, 1, 0, 0, 1, 1, 0],
>>>>>> [1, 0, 0, 0, 0, 1, 0, 1],
>>>>>> [1, 0, 0, 1, 0, 1, 0, 1],
>>>>>> [1, 1, 0, 0, 0, 0, 0, 0],
>>>>>> [1, 1, 1, 1, 1, 0, 0, 0]]
>>>>>> and b=2, we'd memoize offsets for prefixes of size 2:
>>>>>> [0, # 00
>>>>>> 3, # 01
>>>>>> 6, # 10
>>>>>> 8, # 11
>>>>>> ]
>>>>>>
>>>>>> Given a query like [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down
>>>>>> to leave the first b bits [0, 1] remaining, and look them up in the array
>>>>>> just defined to identify a much narrower search space [3, 6) matching
>>>>>> that
>>>>>> prefix in the overall index.
>>>>>>
>>>>>> Indeed, given the min_hash_match constraint, not having this sort of
>>>>>> thing for b >= min_hash_match seems wasteful.
>>>>>>
>>>>>> This provides us O(1) access to the top layers of the tree when
>>>>>> ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
>>>>>> rather than log(n). It is also much more like traditional LSH. However,
>>>>>> it
>>>>>> complexifies the code, as we now have to consider two strategies for
>>>>>> descent/ascent.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 30 July 2015 at 21:46, Joel Nothman <joel.noth...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> What makes you think this is the main bottleneck? While it is not an
>>>>>>> insignificant consumer of time, I really doubt this is what's making
>>>>>>> scikit-learn's LSH implementation severely underperform with respect to
>>>>>>> other implementations.
>>>>>>>
>>>>>>> We need to profile. In order to do that, we need some sensible
>>>>>>> parameters that users might actually want, e.g. number of features for
>>>>>>> {dense, sparse} cases, index size, target 10NN precision and recall
>>>>>>> (selecting corresponding n_estimators and n_candidates). Ideally we'd
>>>>>>> consider real-world datasets. And of course, these should be sensible
>>>>>>> for
>>>>>>> whichever metric we're operating over, and whether we're doing KNN or
>>>>>>> Radius searches.
>>>>>>>
>>>>>>> I don't know if it's realistic, but I've profiled the following
>>>>>>> bench_plot_approximate_neighbors settings:
>>>>>>>
>>>>>>> Building NearestNeighbors for 100000 samples in 100 dimensions
>>>>>>> LSHF parameters: n_estimators = 15, n_candidates = 100
>>>>>>> Building LSHForest for 100000 samples in 100 dimensions
>>>>>>> Done in 1.492s
>>>>>>> Average time for lshf neighbor queries: 0.005s
>>>>>>> Average time for exact neighbor queries: 0.002s
>>>>>>> Average Accuracy : 0.88
>>>>>>> Speed up: 0.5x
>>>>>>>
>>>>>>> Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
>>>>>>> matrix, we have:
>>>>>>>
>>>>>>> - 0.03 spent in _query (hashing and descending)
>>>>>>> - 0.91 spent in _compute_distances (exact distance calculation)
>>>>>>> - 3.80 remaining in _get_candidates (ascending phase), almost
>>>>>>> all of which is spent in _find_matching_indices
>>>>>>>
>>>>>>> Cutting exact distance calculation to 0s will still not get this
>>>>>>> faster than the exact approach. Of course, your mileage may vary, but
>>>>>>> this
>>>>>>> suggests to me you're barking up the wrong tree (no pun intended).
>>>>>>>
>>>>>>> On 30 July 2015 at 19:43, Maheshakya Wijewardena <
>>>>>>> pmaheshak...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've started to look into the matter of improving performance of
>>>>>>>> LSHForest. As we have discussed sometime before(in fact, quite a long
>>>>>>>> time), main concern is to Cythonize distance calculations. Currently,
>>>>>>>> this
>>>>>>>> done by iteratively moving over all the query vectors when `kneighbors`
>>>>>>>> method is called for a set of query vectors. It has been identified
>>>>>>>> that
>>>>>>>> iterating over each query with Python loops is a huge overhead. I have
>>>>>>>> implemented a few Cython hacks to demonstrate the distance calculation
>>>>>>>> in
>>>>>>>> LSHForest and I was able to get an approximate speedup 10x compared to
>>>>>>>> current distance calculation with a Python loop. However, I came
>>>>>>>> across
>>>>>>>> some blockers while trying to do this and need some clarifications.
>>>>>>>>
>>>>>>>> What I need to know is, do we use a mechanism to release GIL when
>>>>>>>> we want to parallelize. One of my observations is `pairwise_distance`
>>>>>>>> uses
>>>>>>>> all the cores even when I don't specify `n_jobs` parameter which is 1
>>>>>>>> in
>>>>>>>> default. Is this an expected behavior?
>>>>>>>>
>>>>>>>> If I want to release GIL, can I use OpenMP module in Cython? Or is
>>>>>>>> that a task of Joblib?
>>>>>>>> Any input on this is highly appreciated.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> --
>>>>>>>>
>>>>>>>> *Maheshakya Wijewardena,Undergraduate,*
>>>>>>>> *Department of Computer Science and Engineering,*
>>>>>>>> *Faculty of Engineering.*
>>>>>>>> *University of Moratuwa,*
>>>>>>>> *Sri Lanka*
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *Maheshakya Wijewardena,Undergraduate,*
>>> *Department of Computer Science and Engineering,*
>>> *Faculty of Engineering.*
>>> *University of Moratuwa,*
>>> *Sri Lanka*
>>>
>>
>>
>>
>> --
>>
>> *Maheshakya Wijewardena,Undergraduate,*
>> *Department of Computer Science and Engineering,*
>> *Faculty of Engineering.*
>> *University of Moratuwa,*
>> *Sri Lanka*
>>
>
>
--
*Maheshakya Wijewardena,Undergraduate,*
*Department of Computer Science and Engineering,*
*Faculty of Engineering.*
*University of Moratuwa,*
*Sri Lanka*
==========================================================
Sample size: 1000
------------------------
Building NearestNeighbors for 1000 samples in 100 dimensions
LSHF parameters: n_estimators = 3, n_candidates = 50
Building LSHForest for 1000 samples in 100 dimensions
(3, 32)
Done in 0.009s
Average time for lshf neighbor queries: 0.002s
Average time for exact neighbor queries: 0.000s
Average Accuracy : 0.80
Speed up: 0.1x
LSHF parameters: n_estimators = 5, n_candidates = 70
Building LSHForest for 1000 samples in 100 dimensions
(5, 32)
Done in 0.026s
Average time for lshf neighbor queries: 0.002s
Average time for exact neighbor queries: 0.000s
Average Accuracy : 0.89
Speed up: 0.1x
LSHF parameters: n_estimators = 10, n_candidates = 100
Building LSHForest for 1000 samples in 100 dimensions
(10, 32)
Done in 0.031s
Average time for lshf neighbor queries: 0.005s
Average time for exact neighbor queries: 0.000s
Average Accuracy : 0.93
Speed up: 0.0x
==========================================================
==========================================================
Sample size: 10000
------------------------
Building NearestNeighbors for 10000 samples in 100 dimensions
LSHF parameters: n_estimators = 3, n_candidates = 50
Building LSHForest for 10000 samples in 100 dimensions
(3, 32)
Done in 0.057s
Average time for lshf neighbor queries: 0.001s
Average time for exact neighbor queries: 0.000s
Average Accuracy : 0.52
Speed up: 0.4x
LSHF parameters: n_estimators = 5, n_candidates = 70
Building LSHForest for 10000 samples in 100 dimensions
(5, 32)
Done in 0.097s
Average time for lshf neighbor queries: 0.001s
Average time for exact neighbor queries: 0.000s
Average Accuracy : 0.74
Speed up: 0.2x
LSHF parameters: n_estimators = 10, n_candidates = 100
Building LSHForest for 10000 samples in 100 dimensions
(10, 32)
Done in 0.192s
Average time for lshf neighbor queries: 0.003s
Average time for exact neighbor queries: 0.000s
Average Accuracy : 0.92
Speed up: 0.1x
==========================================================
==========================================================
Sample size: 100000
------------------------
Building NearestNeighbors for 100000 samples in 100 dimensions
LSHF parameters: n_estimators = 3, n_candidates = 50
Building LSHForest for 100000 samples in 100 dimensions
(3, 32)
Done in 0.586s
Average time for lshf neighbor queries: 0.001s
Average time for exact neighbor queries: 0.003s
Average Accuracy : 0.36
Speed up: 2.5x
LSHF parameters: n_estimators = 5, n_candidates = 70
Building LSHForest for 100000 samples in 100 dimensions
(5, 32)
Done in 0.993s
Average time for lshf neighbor queries: 0.001s
Average time for exact neighbor queries: 0.003s
Average Accuracy : 0.50
Speed up: 2.4x
LSHF parameters: n_estimators = 10, n_candidates = 100
Building LSHForest for 100000 samples in 100 dimensions
(10, 32)
Done in 1.815s
Average time for lshf neighbor queries: 0.002s
Average time for exact neighbor queries: 0.003s
Average Accuracy : 0.74
Speed up: 1.5x
==========================================================
==========================================================
Sample size: 1000000
------------------------
Building NearestNeighbors for 1000000 samples in 100 dimensions
LSHF parameters: n_estimators = 3, n_candidates = 50
Building LSHForest for 1000000 samples in 100 dimensions
(3, 32)
Done in 5.077s
Average time for lshf neighbor queries: 0.005s
Average time for exact neighbor queries: 0.032s
Average Accuracy : 0.39
Speed up: 6.2x
LSHF parameters: n_estimators = 5, n_candidates = 70
Building LSHForest for 1000000 samples in 100 dimensions
(5, 32)
Done in 8.371s
Average time for lshf neighbor queries: 0.007s
Average time for exact neighbor queries: 0.032s
Average Accuracy : 0.54
Speed up: 4.3x
LSHF parameters: n_estimators = 10, n_candidates = 100
Building LSHForest for 1000000 samples in 100 dimensions
(10, 32)
Done in 16.895s
Average time for lshf neighbor queries: 0.012s
Average time for exact neighbor queries: 0.032s
Average Accuracy : 0.74
Speed up: 2.7x
==========================================================
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general