While vanilla LSH is an interesting baseline for Approximate Nearest Neighbors search, it is often too error-prone to be practically useful. There exists alternative data-driven ANN methods that can have a much lower error rates (depending on the data). Among the top implementations there are FLANN [1] and Spotify's Annoy [2]. Both are written in C++ with Python bindings: there is a bench here by Radim:
http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/ It would be interesting to implement the baseline vanilla LSH (either with random projections or min-hash) and / or Cython versions of the random projection forests from Annoy and / or the Hierarchical k-means trees and randomized K-D trees from FLANN. All approaches could be probably be implemented quite effiently by reusing existing functions and classes from other sklearn modules. For instance building 100 FLANN-style randomized kd-trees would look very similar to: ExtraTreesRegressor(n_estimators=100, max_features=1).fit(data, np.zeros(data.shape[0])) max_features=1 makes the extra trees fully random. Then we would use the apply method to compute the hashing itself. Implementing a fit_apply in the sklearn decision trees and forests would make render initial index building even more efficient. For more details, read the FLANN paper (very interesting) [3] and the source code of the Annoy random projection tree building [4]. [1] https://github.com/mariusmuja/flann [2] https://github.com/spotify/annoy [3] http://people.cs.ubc.ca/%7Emariusm/uploads/FLANN/flann_visapp09.pdf [4] https://github.com/spotify/annoy/blob/master/src/annoylib.cc#L354 -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
