Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Maheshakya Wijewardena Sat, 08 Mar 2014 22:37:24 -0800

Hi Daniel,

Thank you for pointing this out. I also have noticed that storing hash
tables of training data and dealing with binary vectors might be
problematic when using plain LSH based ANN.
However, before implementing LSH forests, I need to get the approval of the
community. I'm looking forward to have feedback on this.


Thank you,
Maheshakya.


On Thu, Mar 6, 2014 at 2:34 PM, Daniel Vainsencher <
[email protected]> wrote:

>  Hi Maheshakya,
>
> In regular LSH, a particular setting of the number of hash functions per
> index (k) and the number of indexes (L) essentially determines the size of
> the region in space from which candidates will be chosen in response to a
> query. If queries q1 and q2 are in areas where the density of data points
> is very different, then one will receive many more candidates than the
> other, thus be slower and more accurate. Thus tuning k, L involves a
> tradeoff, and with varied density, you cannot win.
>
> Even worse: as you add data, your k, L stop being optimal. So tuning on a
> significantly smaller sample is not effective, which makes tuning even more
> expensive.
>
> Going back to LSH Forest, while an implementation using binary trees is
> possible and ok for testing, but not very space or cache efficient. Other
> data structures (b-trees and variants, see [2] for some discussion) are
> more appropriate. The key operations are range queries and depending on
> data change rate, insertion.
>
> Space efficiency of indexes is also affected by the hash type: if you use
> a binary hash functions, choose k in {32, 64} and use the appropriate
> integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest
> has no difficulty with such high k in the index (as opposed to plain LSH).
>
> Daniel Vainsencher
>
>
> On 03/05/2014 07:18 AM, Maheshakya Wijewardena wrote:
>
> Hi Daniel,
> These alternative strategies of LSH are a good way to start using the
> existing implementations of Scikit-learn. But why do you say the standard
> algorithms are impractical?
>
>  Maheshakya
>
>
>  On Wed, Mar 5, 2014 at 1:31 AM, Daniel Vainsencher <
> [email protected]> wrote:
>
>>  Hi scikit-learn, Maheshakya,
>>
>> In my experience implementing and using LSH, the vanilla algorithms are
>> very impractical. Two extensions that help a lot (and combine nicely) are:
>> - Multi-probe LSH [1], where hash values neighboring the target's hash
>> are also used.
>> - LSH Forest [2], where fixed length hashes are replaced by length
>> bounded hashes; the hash used is just long enough to get sufficiently few
>> candidates for the target's nearest neighbor.
>>
>> Together they make it feasible to limit the number of independent copies
>> of the index to a reasonable number. Incidentally, the LSH Forest approach
>> may be implemented using binary trees.
>>
>> Daniel Vainsencher
>>
>> [1] www.cs.princeton.edu/cass/papers/mp*lsh*_vldb07.pdf
>> [2] www2005.org/docs/p651.pdf
>>
>>
>> On 03/04/2014 06:09 PM, Maheshakya Wijewardena wrote:
>>
>>  Hi,
>>
>> Currently, unsupervised neighbor search is implemented in
>> sklearn.neighbors. There are three algorithms used to perform neighbor
>> search.
>>
>>    - kd tree
>>    - ball tree
>>    - brute force method
>>
>> kd tree and ball tree are implemented separately as cython
>> implementations. Both of them use binary tree(binary_tree.pxi). In the
>> NeighborBass class, above implementations are used.
>>
>>  As Locality Sensitivity Hashing will also be used in approximating
>> nearest neighbors, its' implementation should also be a part of
>> sklearn.neighbors. But Locality sensitivity hashing algorithms do not
>> involve binary tree implementation. Therefore it has to be implemented
>> separately and be used in sklearn.neighbors as another algorithm.Is this
>> approach correct?
>>
>>  I'm trying to implement a base LSH class and the hashing algorithms as
>> cython implementations. To approximate neighbor search using those
>> algorithms, multiple hash tables will be created with those hash functions
>> so, for that an efficient way to store is required.
>>
>>  These are some of the issues I found disturbing while I was trying to
>> implement LSH to approximate neighbor search. Is this an appropriate
>> approach to implement this or are there any other better ways?
>>
>>
>>
>> Regards,
>> Maheshakya,
>> --
>> Undergraduate,
>> Department of Computer Science and Engineering,
>> Faculty of Engineering.
>> University of Moratuwa,
>> Sri Lanka
>>
>>
>>  
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to 
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries.  Built-in WAN optimization and the
>> freedom to use Git, Perforce or both. Make the move to 
>> Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing 
>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries.  Built-in WAN optimization and
>> the
>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
>  --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and the
> freedom to use Git, Perforce or both. Make the move to 
> Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>
>
>
> _______________________________________________
> Scikit-learn-general mailing 
> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Reply via email to