What makes you think this is the main bottleneck? While it is not an
insignificant consumer of time, I really doubt this is what's making
scikit-learn's LSH implementation severely underperform with respect to
other implementations.

We need to profile. In order to do that, we need some sensible parameters
that users might actually want, e.g. number of features for {dense, sparse}
cases, index size, target 10NN precision and recall (selecting
corresponding n_estimators and n_candidates). Ideally we'd consider
real-world datasets. And of course, these should be sensible for whichever
metric we're operating over, and whether we're doing KNN or Radius searches.

I don't know if it's realistic, but I've profiled the following
bench_plot_approximate_neighbors settings:

Building NearestNeighbors for 100000 samples in 100 dimensions
LSHF parameters: n_estimators = 15, n_candidates = 100
Building LSHForest for 100000 samples in 100 dimensions
Done in 1.492s
Average time for lshf neighbor queries: 0.005s
Average time for exact neighbor queries: 0.002s
Average Accuracy : 0.88
Speed up: 0.5x

Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query matrix,
we have:

   - 0.03 spent in _query (hashing and descending)
   - 0.91 spent in _compute_distances (exact distance calculation)
   - 3.80 remaining in _get_candidates (ascending phase), almost all of
   which is spent in _find_matching_indices

Cutting exact distance calculation to 0s will still not get this faster
than the exact approach. Of course, your mileage may vary, but this
suggests to me you're barking up the wrong tree (no pun intended).

On 30 July 2015 at 19:43, Maheshakya Wijewardena <pmaheshak...@gmail.com>
wrote:

> Hi,
>
> I've started to look into the matter of improving performance of
> LSHForest. As we have discussed sometime before(in fact, quite a long
> time), main concern is to Cythonize distance calculations. Currently, this
> done by iteratively moving over all the query vectors when `kneighbors`
> method is called for a set of query vectors. It has been identified that
> iterating over each query with Python loops is a huge overhead. I have
> implemented a few Cython hacks to demonstrate the distance calculation in
> LSHForest and I was able to get an approximate speedup 10x compared to
> current distance calculation with a Python loop. However,  I came across
> some blockers while trying to do this and need some clarifications.
>
> What I need to know is, do we use a mechanism to release GIL when we want
> to parallelize. One of my observations is `pairwise_distance` uses all the
> cores even when I don't specify `n_jobs` parameter which is 1 in default.
> Is this an expected behavior?
>
> If I want to release GIL, can I use OpenMP module in Cython? Or is that a
> task of Joblib?
> Any input on this is highly appreciated.
>
> Best regards,
> --
>
> *Maheshakya Wijewardena,Undergraduate,*
> *Department of Computer Science and Engineering,*
> *Faculty of Engineering.*
> *University of Moratuwa,*
> *Sri Lanka*
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to