As a starter, I was thinking of implementing following plain LSH algorithm.
But as Daniel Vainsencher has showed (In the other mailing thread) we can
use LSH forest.

In regular LSH, a particular setting of the number of hash functions per
> index (k) and the number of indexes (L) essentially determines the size of
> the region in space from which candidates will be chosen in response to a
> query. If queries q1 and q2 are in areas where the density of data points
> is very different, then one will receive many more candidates than the
> other, thus be slower and more accurate. Thus tuning k, L involves a
> tradeoff, and with varied density, you cannot win.
> Even worse: as you add data, your k, L stop being optimal. So tuning on a
> significantly smaller sample is not effective, which makes tuning even more
> expensive.
> Going back to LSH Forest, while an implementation using binary trees is
> possible and ok for testing, but not very space or cache efficient. Other
> data structuresare more appropriate. The key operations are range queries
> and depending on data change rate, insertion.
> Space efficiency of indexes is also affected by the hash type: if you use
> a binary hash functions, choose k in {32, 64} and use the appropriate
> integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest
> has no difficulty with such high k in the index (as opposed to plain LSH).


LSH forest is described in this paper: Lsh forest: self-tuning indexes for
similarity 
search<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.163.4596>.
This has a passable number of citations as well.

So rather that implementing the plain LSH ANN algorithm, what if we go for
this implementation. In my opinion, this can use the existing modules in
Scikit-learn as well. So that, we can keep the hashing algorithms same and
the structure to store the transformed/indexed training data has to be
changed.
What are your opinion on this?

Regards,
Maheshakya


On Thu, Feb 27, 2014 at 1:26 AM, Maheshakya Wijewardena <
pmaheshak...@gmail.com> wrote:

> I would also rather focus on non-binary representations.
>
>
> Even when using Random Projection method for hashing, only sign of the
> result of dot product is considered. So that, in that situation also, there
> will be a binary representation( or +1s and -1s). What is your idea about
> this method?
>
> Nearest neighbor search has been implemented in Scikit-learn in
> sklearn.neighbors. In unsupervised.py, NeighborsBase class is used
> and NeighborsBase (in base.py) contains following methods to perform the
> search.
>
>    - brute  - a brute force linear search
>    - kd_tree - KD tree search
>    - ball_tree - binary tree search
>
> So we can add LSH based search as another algorithm type in
> NearestNeighbors.
>
> In order to perform neighbor search using LSH, those hashing methods
> should be implemented separately(In another file). There will be multiple
> hash tables built by concatenating hash functions. Here, I notice an issue.
> As we generated a significantly large number of hash tables, there must be
> a way to store them efficiently. Is there a way to do this in the
> Scikit-learn way? This part will also have to be implemented outside the
> NeighborBase class.
> The logic for performing the search using computed computed hash tables
> should be included in the NeighborBase.
>
> This is my basic opinion on how to implement LSH based neighbor search in
> Scikit-learn. Your feedback and suggestions for improvements are welcome.
> [?]
>
> Regards,
> Maheshakya.
>
>
> On Thu, Feb 27, 2014 at 12:28 AM, Andy <t3k...@gmail.com> wrote:
>
>>  On 02/26/2014 10:13 AM, Maheshakya Wijewardena wrote:
>>
>> The method "Bit sampling for Hamming distance" is already included in
>> "brute" algorithm as the metric "hamming" in Nearest neighbor search.
>> Hence, I think that does not need to be implemented as a LSH algorithm
>>
>> I would also rather focus on non-binary representations.
>> There is no efficient way to work with binary data in numpy afaik -- at
>> least none that is supported in sklearn.
>>
>> I'm very interested in this project but unfortunately I don't have the
>> time to mentor.
>>
>> Cheers,
>> Andy
>>
>>
>> On Wed, Feb 26, 2014 at 12:46 AM, Maheshakya Wijewardena <
>> pmaheshak...@gmail.com> wrote:
>>
>>> Approximating Nearest neighbor search is one of the application of
>>> locality sensitive hashing.There are five major methods.
>>>
>>>    - Bit sampling for Hamming distance
>>>     - Min-wise independent permutations
>>>     - Nilsimsa Hash
>>>     - Random projection
>>>     - Stable distributions
>>>
>>> Bit sampling method is fairly straight forward. A reference for the
>>> implementation of Random projection method can be taken from *lshash
>>> <https://pypi.python.org/pypi/lshash>* library.
>>>  I'm looking forward to see comments for this from prospective mentors
>>> of this project.
>>>
>>>  Thank you.
>>>  Maheshakya.
>>>
>>>
>>>
>>>  On Tue, Feb 25, 2014 at 8:24 AM, Maheshakya Wijewardena <
>>> pmaheshak...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I have looked into this project idea. I have studied this method and I
>>>> like to discuss further on this.
>>>>  I would like to know who the mentors for this project are and to get
>>>> some insight on how to begin.
>>>>
>>>>  Regards,
>>>> Maheshakya,
>>>> --
>>>> Undergraduate,
>>>> Department of Computer Science and Engineering,
>>>> Faculty of Engineering.
>>>> University of Moratuwa,
>>>>  Sri Lanka
>>>>
>>>
>>>
>>>
>>>  --
>>> Undergraduate,
>>> Department of Computer Science and Engineering,
>>> Faculty of Engineering.
>>> University of Moratuwa,
>>>  Sri Lanka
>>>
>>
>>
>>
>>  --
>> Undergraduate,
>> Department of Computer Science and Engineering,
>> Faculty of Engineering.
>> University of Moratuwa,
>>  Sri Lanka
>>
>>
>> ------------------------------------------------------------------------------
>> Flow-based real-time traffic analytics software. Cisco certified tool.
>> Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
>> Customize your own dashboards, set traffic alerts and generate reports.
>> Network behavioral analysis & security monitoring. All-in-one 
>> tool.http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing 
>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Flow-based real-time traffic analytics software. Cisco certified tool.
>> Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
>> Customize your own dashboards, set traffic alerts and generate reports.
>> Network behavioral analysis & security monitoring. All-in-one tool.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>



-- 
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

<<330.png>>

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to