On 03/16/2014 08:59 PM, Olivier Grisel wrote:
> 2014-03-14 19:40 GMT+01:00 Daniel Vainsencher <[email protected]>:
>> Hi Olivier,
>>
>> Have you looked at the LSH-Forest (and to a lesser extent, Multiprobe)
>> paper? I've used it in practice, and it makes a significant improvement
>> vs. basic LSH. I've proposed in recent emails to the list how to
>> implement it based on sorting and binary search (in order to avoid the
>> dependency/maintenance burden of range-queries).
>
> No I did not read those papers (yet).
I don't know the culture here, but since you seemed to summarize rather 
than wait for someone to read up on LSH:
- Where LSH uses hash-maps keyed on the a random hash of length exactly 
k, LSHF uses an ordered map (binary tree or BTree in the paper). This 
enables queries of form:
"all keys whose prefix of length h<=k matches this hash". This trivially 
enables a per tree query of type "give me the best q hashes (and maybe a 
few more)" by doing queries with h=k, h=k-1,... until satisfied (this 
can be optimized in different ways).
- Then queries across multiple trees can be combined in different ways, 
the simplest being "give me at least q/L candidates from each tree, take 
the union". More complex variants don't seem to make much difference.
- So, the parameter k becomes essentially per-query, L can be determined 
on its own, and we can determine the number of expensive "real distance" 
comparisons we want to pay per query.

I offer the simple observation that sorted arrays support very fast 
range queries, so that complex/expensive trees are not necessary for an 
LSHF implementation.

As Maheshakya said it might very
> well be the method implemented in Annoy.
It seems similar but distinct, see a recent mail.

> Do you have sample python
> code around?
No, I implemented it in C#, and source is not available.

>> I think evaluating and tweaking vanilla LSH is a waste of time, when an
>> obvious improvement that fixes a significant real world flaw (control of
>> number of candidates seen) in the method is available.
> If the obvious improvement does not come with a significant increment
> in code complexity (and maintenance burden) then I agree, yes.
So, was the above was sufficient to answer your prerequisite?

>Getting rid of hyperparameters to select manually of via grid-search
> is a big usability improvement and can make the method much more
> practical to use.
In this case, performance becomes better than mere tuning would allow, 
since it adapts to database density around the query point.

Daniel

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to