|
Hi Maheshakya,
In regular LSH, a particular setting of the number of hash functions per index (k) and the number of indexes (L) essentially determines the size of the region in space from which candidates will be chosen in response to a query. If queries q1 and q2 are in areas where the density of data points is very different, then one will receive many more candidates than the other, thus be slower and more accurate. Thus tuning k, L involves a tradeoff, and with varied density, you cannot win. Even worse: as you add data, your k, L stop being optimal. So tuning on a significantly smaller sample is not effective, which makes tuning even more expensive. Going back to LSH Forest, while an implementation using binary trees is possible and ok for testing, but not very space or cache efficient. Other data structures (b-trees and variants, see [2] for some discussion) are more appropriate. The key operations are range queries and depending on data change rate, insertion. Space efficiency of indexes is also affected by the hash type: if you use a binary hash functions, choose k in {32, 64} and use the appropriate integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest has no difficulty with such high k in the index (as opposed to plain LSH). Daniel Vainsencher On 03/05/2014 07:18 AM, Maheshakya Wijewardena wrote:
|
------------------------------------------------------------------------------ Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
