Re: [Scikit-learn-general] Google Summer of Code 2014

Olivier Grisel Tue, 28 Jan 2014 06:35:58 -0800

2014/1/28 Mathieu Blondel <[email protected]>:
>
>
>
> On Tue, Jan 28, 2014 at 9:25 PM, Olivier Grisel <[email protected]>
> wrote:
>>
>> While vanilla LSH is an interesting baseline for Approximate Nearest
>> Neighbors search, it is often too error-prone to be practically
>> useful. There exists alternative data-driven ANN methods that can have
>> a much lower error rates (depending on the data). Among the top
>> implementations there are FLANN [1] and Spotify's Annoy [2]. Both are
>> written in C++ with Python bindings: there is a bench here by Radim:
>>
>>
>> http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/
>>
>> It would be interesting to implement the baseline vanilla LSH (either
>> with random projections or min-hash) and / or Cython versions of the
>> random projection forests from Annoy and / or the Hierarchical k-means
>> trees and randomized K-D trees from FLANN.
>
>
> As always, I think the rule of thumb for inclusion in scikit-learn should be
> that the algorithm is standard in the field and have a fairly high citation
> count. Is this the case for the algorithms you mention? What are the 2 or 3
> most famous LSH algorithms?


The original FLANN paper is from 2009 and has 168 citations:

http://citeseer.ist.psu.edu/showciting;jsessionid=444F2769597BBE3617FC3AF812E51FAD?doi=10.1.1.160.1721

The method implemented in Annoy (a forest or rejection sampled random
projections for the tree nodes) is an empirical trick that is
apparently common among practitioners but I do not know the main
reference:

https://twitter.com/ogrisel/status/428137534632099840

The code looks very simple and we can reuse our sparse random
projection matrices to spare some memory and speed up projections.

>> All approaches could be probably be implemented quite effiently by
>> reusing existing functions and classes from other sklearn modules. For
>> instance building 100 FLANN-style randomized kd-trees would look very
>> similar to:
>>
>> ExtraTreesRegressor(n_estimators=100, max_features=1).fit(data,
>> np.zeros(data.shape[0]))
>>
> That sounds similar to
>
> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomTreesEmbedding.html

This is indeed the same. What is missing is storing the transformed /
indexed training data and a method to run KNN queries on the output
(e.g. using the same API as in exact nearest neighbors methods the
sklearn.neighbors package).

Anyway even if we decide not to include the most recent ANN methods in
the main sklearn repo but instead as a side repo that follows the
sklearn API and coding conventions I think it would be an interesting
topic for a GSoC.

The baseline LSH method could be implemented directly in sklearn though.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Google Summer of Code 2014

Reply via email to