While vanilla LSH is an interesting baseline for Approximate Nearest
Neighbors search, it is often too error-prone to be practically
useful. There exists alternative data-driven ANN methods that can have
a much lower error rates (depending on the data). Among the top
implementations there are FLANN [1] and Spotify's Annoy [2]. Both are
written in C++ with Python bindings: there is a bench here by Radim:

http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/

It would be interesting to implement the baseline vanilla LSH (either
with random projections or min-hash) and / or Cython versions of the
random projection forests from Annoy and / or the Hierarchical k-means
trees and randomized K-D trees from FLANN.

All approaches could be probably be implemented quite effiently by
reusing existing functions and classes from other sklearn modules. For
instance building 100 FLANN-style randomized kd-trees would look very
similar to:

ExtraTreesRegressor(n_estimators=100, max_features=1).fit(data,
np.zeros(data.shape[0]))

max_features=1 makes the extra trees fully random. Then we would use
the apply method to compute the hashing itself. Implementing a
fit_apply in the sklearn decision trees and forests would make render
initial index building even more efficient.

For more details, read the FLANN paper (very interesting) [3] and the
source code of the Annoy random projection tree building [4].

[1] https://github.com/mariusmuja/flann
[2] https://github.com/spotify/annoy
[3] http://people.cs.ubc.ca/%7Emariusm/uploads/FLANN/flann_visapp09.pdf
[4] https://github.com/spotify/annoy/blob/master/src/annoylib.cc#L354

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to