Hi Olivier,
According to the documentation of the FLANN, among the neighbor search
methods, there is a LSH based approach. It uses multi-probe LSH. But that
type of index can only be used for matching binary features using Hamming
distances. So the LSH-ANN used in that is not usable in general setting.
According to Radim(
http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/)
ANNOYs performance is much better than FLANN. The only method used in ANNOY
is random projection trees with a little tweak on generating the random
hyperplanes. Structure of this is very similar to LSH-forest implementation
(which Daniel suggested about). Maybe if we can apply that method to
perform random projection in LSH-forest, it may show improvements. So I
have a suggestion to test this approach as well during prototyping.
There are lots of literature that states vanilla LSH is not ideal to use in
practice as it is not efficient memory-wise and the difficulties when
choosing parameters signature length and number of hash tables(k and L).
Robert, you have experience in using standard LSH algorithm, right? Can you
suggest any improvements that might come in handy while prototyping.
Best Regards,
Maheshakya.
On Sat, Mar 15, 2014 at 3:06 AM, Maheshakya Wijewardena <
[email protected]> wrote:
> Isn't the algorithm used in Annoy is somewhat similar to LSH-forest. In
> Annoy, a binary tree is built using the result of a random projection on a
> training data points at each node. (Random projection itself is method to
> do LSH) The main difference I see is the data driven nature of Random
> projection in Annoy.(Only when metric for AnnoyIndex='euclidean')
>
> Maheshakya
>
>
> On Sat, Mar 15, 2014 at 12:10 AM, Daniel Vainsencher <
> [email protected]> wrote:
>
>> Hi Olivier,
>>
>> Have you looked at the LSH-Forest (and to a lesser extent, Multiprobe)
>> paper? I've used it in practice, and it makes a significant improvement
>> vs. basic LSH. I've proposed in recent emails to the list how to
>> implement it based on sorting and binary search (in order to avoid the
>> dependency/maintenance burden of range-queries).
>>
>> I think evaluating and tweaking vanilla LSH is a waste of time, when an
>> obvious improvement that fixes a significant real world flaw (control of
>> number of candidates seen) in the method is available.
>>
>> Daniel
>>
>> On 03/14/2014 11:38 AM, Olivier Grisel wrote:
>> > 2014-03-13 23:15 GMT+01:00 Robert Layton <[email protected]>:
>> >> Thanks Gael. My thinking was to implement "Basic LSH with basic data
>> >> structures" and then spend some of the time working on seeing if
>> moderate
>> >> improvements (i.e. a more complex data structure) can deliver
>> benefits. This
>> >> way, we get the key deliverable, and spend some time trying to see if
>> we can
>> >> do better.
>> >>
>> >> I'd also like to see scalability added to the evaluation criteria!
>> >
>> > The problem is that by having had a look at the ANN literature in the
>> > past, my gut feeling is that basic LSH is mostly worthless for
>> > practical applications. I would rather not have a method is
>> > scikit-learn that is not performing good enough (in terms of speed up
>> > vs exact brute force queries) to be useful on real data.
>> >
>> > To decide I think the best way to proceed would be to have a
>> > evaluation / prototyping stage in the project that would first hack an
>> > implementation in of the basic methods (at least random projections
>> > and maybe others) in a gist outside of the scikit-learn codebase, not
>> > to worry about documentation and compliance with the API and benchmark
>> > it / profile it on representative datasets with various statistical
>> > structures and compare the outcome to alternative methods such as the
>> > methods implemented in FLANN and Annoy.
>> >
>> > Actually there exists several Python implementation of vanilla LSH:
>> >
>> > https://pypi.python.org/pypi/lshash
>> > http://nearpy.io/
>> > https://github.com/yahoo/Optimal-LSH/blob/master/lsh.py
>> >
>> > It would be interesting to benchmark them all on the same datasets
>> > (with various dimensionality, ranks and sparsity patterns).
>> >
>> > Note, that Radim already had a look at the yahoo LSH implementation
>> > and found it impractical to use:
>> >
>> >
>> http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/
>> >
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general