Re: [Scikit-learn-general] Google Summer of Code 2014

Arnaud Joly Tue, 28 Jan 2014 06:42:52 -0800

On 28 Jan 2014, at 15:31, Olivier Grisel <[email protected]> wrote:


> 2014/1/28 Mathieu Blondel <[email protected]>:
>> 
>> 
>> 
>> On Tue, Jan 28, 2014 at 9:25 PM, Olivier Grisel <[email protected]>
>> wrote:
>>> 
>>> While vanilla LSH is an interesting baseline for Approximate Nearest
>>> Neighbors search, it is often too error-prone to be practically
>>> useful. There exists alternative data-driven ANN methods that can have
>>> a much lower error rates (depending on the data). Among the top
>>> implementations there are FLANN [1] and Spotify's Annoy [2]. Both are
>>> written in C++ with Python bindings: there is a bench here by Radim:
>>> 
>>> 
>>> http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/
>>> 
>>> It would be interesting to implement the baseline vanilla LSH (either
>>> with random projections or min-hash) and / or Cython versions of the
>>> random projection forests from Annoy and / or the Hierarchical k-means
>>> trees and randomized K-D trees from FLANN.
>> 
>> 
>> As always, I think the rule of thumb for inclusion in scikit-learn should be
>> that the algorithm is standard in the field and have a fairly high citation
>> count. Is this the case for the algorithms you mention? What are the 2 or 3
>> most famous LSH algorithms?
> 
> The original FLANN paper is from 2009 and has 168 citations:
> 
> http://citeseer.ist.psu.edu/showciting;jsessionid=444F2769597BBE3617FC3AF812E51FAD?doi=10.1.1.160.1721
> 
> The method implemented in Annoy (a forest or rejection sampled random
> projections for the tree nodes) is an empirical trick that is
> apparently common among practitioners but I do not know the main
> reference:
> 
> https://twitter.com/ogrisel/status/428137534632099840
> 
> The code looks very simple and we can reuse our sparse random
> projection matrices to spare some memory and speed up projections.
> 

I don’t know annoy, but could it be random projections trees
as in http://cseweb.ucsd.edu/~dasgupta/papers/rptree-stoc.pdf?

Arnaud

------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Google Summer of Code 2014

Reply via email to