Hi Daniel,
Thank you for pointing this out. I also have noticed that storing hash
tables of training data and dealing with binary vectors might be
problematic when using plain LSH based ANN.
However, before implementing LSH forests, I need to get the approval of the
community. I'm looking forward to have feedback on this.
Thank you,
Maheshakya.
On Thu, Mar 6, 2014 at 2:34 PM, Daniel Vainsencher <
[email protected]> wrote:
> Hi Maheshakya,
>
> In regular LSH, a particular setting of the number of hash functions per
> index (k) and the number of indexes (L) essentially determines the size of
> the region in space from which candidates will be chosen in response to a
> query. If queries q1 and q2 are in areas where the density of data points
> is very different, then one will receive many more candidates than the
> other, thus be slower and more accurate. Thus tuning k, L involves a
> tradeoff, and with varied density, you cannot win.
>
> Even worse: as you add data, your k, L stop being optimal. So tuning on a
> significantly smaller sample is not effective, which makes tuning even more
> expensive.
>
> Going back to LSH Forest, while an implementation using binary trees is
> possible and ok for testing, but not very space or cache efficient. Other
> data structures (b-trees and variants, see [2] for some discussion) are
> more appropriate. The key operations are range queries and depending on
> data change rate, insertion.
>
> Space efficiency of indexes is also affected by the hash type: if you use
> a binary hash functions, choose k in {32, 64} and use the appropriate
> integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest
> has no difficulty with such high k in the index (as opposed to plain LSH).
>
> Daniel Vainsencher
>
>
> On 03/05/2014 07:18 AM, Maheshakya Wijewardena wrote:
>
> Hi Daniel,
> These alternative strategies of LSH are a good way to start using the
> existing implementations of Scikit-learn. But why do you say the standard
> algorithms are impractical?
>
> Maheshakya
>
>
> On Wed, Mar 5, 2014 at 1:31 AM, Daniel Vainsencher <
> [email protected]> wrote:
>
>> Hi scikit-learn, Maheshakya,
>>
>> In my experience implementing and using LSH, the vanilla algorithms are
>> very impractical. Two extensions that help a lot (and combine nicely) are:
>> - Multi-probe LSH [1], where hash values neighboring the target's hash
>> are also used.
>> - LSH Forest [2], where fixed length hashes are replaced by length
>> bounded hashes; the hash used is just long enough to get sufficiently few
>> candidates for the target's nearest neighbor.
>>
>> Together they make it feasible to limit the number of independent copies
>> of the index to a reasonable number. Incidentally, the LSH Forest approach
>> may be implemented using binary trees.
>>
>> Daniel Vainsencher
>>
>> [1] www.cs.princeton.edu/cass/papers/mp*lsh*_vldb07.pdf
>> [2] www2005.org/docs/p651.pdf
>>
>>
>> On 03/04/2014 06:09 PM, Maheshakya Wijewardena wrote:
>>
>> Hi,
>>
>> Currently, unsupervised neighbor search is implemented in
>> sklearn.neighbors. There are three algorithms used to perform neighbor
>> search.
>>
>> - kd tree
>> - ball tree
>> - brute force method
>>
>> kd tree and ball tree are implemented separately as cython
>> implementations. Both of them use binary tree(binary_tree.pxi). In the
>> NeighborBass class, above implementations are used.
>>
>> As Locality Sensitivity Hashing will also be used in approximating
>> nearest neighbors, its' implementation should also be a part of
>> sklearn.neighbors. But Locality sensitivity hashing algorithms do not
>> involve binary tree implementation. Therefore it has to be implemented
>> separately and be used in sklearn.neighbors as another algorithm.Is this
>> approach correct?
>>
>> I'm trying to implement a base LSH class and the hashing algorithms as
>> cython implementations. To approximate neighbor search using those
>> algorithms, multiple hash tables will be created with those hash functions
>> so, for that an efficient way to store is required.
>>
>> These are some of the issues I found disturbing while I was trying to
>> implement LSH to approximate neighbor search. Is this an appropriate
>> approach to implement this or are there any other better ways?
>>
>>
>>
>> Regards,
>> Maheshakya,
>> --
>> Undergraduate,
>> Department of Computer Science and Engineering,
>> Faculty of Engineering.
>> University of Moratuwa,
>> Sri Lanka
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries. Built-in WAN optimization and the
>> freedom to use Git, Perforce or both. Make the move to
>> Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing
>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries. Built-in WAN optimization and
>> the
>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries. Built-in WAN optimization and the
> freedom to use Git, Perforce or both. Make the move to
> Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>
>
>
> _______________________________________________
> Scikit-learn-general mailing
> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries. Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general