On 28 Jan 2014, at 15:31, Olivier Grisel <olivier.gri...@ensta.org> wrote:
> 2014/1/28 Mathieu Blondel <math...@mblondel.org>:
>>
>>
>>
>> On Tue, Jan 28, 2014 at 9:25 PM, Olivier Grisel <olivier.gri...@ensta.org>
>> wrote:
>>>
>>> While vanilla LSH is an interesting baseline for Approximate Nearest
>>> Neighbors search, it is often too error-prone to be practically
>>> useful. There exists alternative data-driven ANN methods that can have
>>> a much lower error rates (depending on the data). Among the top
>>> implementations there are FLANN [1] and Spotify's Annoy [2]. Both are
>>> written in C++ with Python bindings: there is a bench here by Radim:
>>>
>>>
>>> http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/
>>>
>>> It would be interesting to implement the baseline vanilla LSH (either
>>> with random projections or min-hash) and / or Cython versions of the
>>> random projection forests from Annoy and / or the Hierarchical k-means
>>> trees and randomized K-D trees from FLANN.
>>
>>
>> As always, I think the rule of thumb for inclusion in scikit-learn should be
>> that the algorithm is standard in the field and have a fairly high citation
>> count. Is this the case for the algorithms you mention? What are the 2 or 3
>> most famous LSH algorithms?
>
> The original FLANN paper is from 2009 and has 168 citations:
>
> http://citeseer.ist.psu.edu/showciting;jsessionid=444F2769597BBE3617FC3AF812E51FAD?doi=10.1.1.160.1721
>
> The method implemented in Annoy (a forest or rejection sampled random
> projections for the tree nodes) is an empirical trick that is
> apparently common among practitioners but I do not know the main
> reference:
>
> https://twitter.com/ogrisel/status/428137534632099840
>
> The code looks very simple and we can reuse our sparse random
> projection matrices to spare some memory and speed up projections.
>
I don’t know annoy, but could it be random projections trees
as in http://cseweb.ucsd.edu/~dasgupta/papers/rptree-stoc.pdf?
Arnaud
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general