2014-03-13 23:15 GMT+01:00 Robert Layton <[email protected]>:
> Thanks Gael. My thinking was to implement "Basic LSH with basic data
> structures" and then spend some of the time working on seeing if moderate
> improvements (i.e. a more complex data structure) can deliver benefits. This
> way, we get the key deliverable, and spend some time trying to see if we can
> do better.
>
> I'd also like to see scalability added to the evaluation criteria!

The problem is that by having had a look at the ANN literature in the
past, my gut feeling is that basic LSH is mostly worthless for
practical applications. I would rather not have a method is
scikit-learn that is not performing good enough (in terms of speed up
vs exact brute force queries) to be useful on real data.

To decide I think the best way to proceed would be to have a
evaluation / prototyping stage in the project that would first hack an
implementation in of the basic methods (at least random projections
and maybe others) in a gist outside of the scikit-learn codebase, not
to worry about documentation and compliance with the API and benchmark
it / profile it on representative datasets with various statistical
structures and compare the outcome to alternative methods such as the
methods implemented in FLANN and Annoy.

Actually there exists several Python implementation of vanilla LSH:

https://pypi.python.org/pypi/lshash
http://nearpy.io/
https://github.com/yahoo/Optimal-LSH/blob/master/lsh.py

It would be interesting to benchmark them all on the same datasets
(with various dimensionality, ranks and sparsity patterns).

Note, that Radim already had a look at the yahoo LSH implementation
and found it impractical to use:

http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to