Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Maheshakya Wijewardena Tue, 18 Mar 2014 07:48:06 -0700

Thank you Robert for your valuable feedback. I will update my proposal
accordingly asap and let you know.


However, I prepared this according to the template given by Python
organization. So I have put information about the projects I have done in
that position according to template. If scikit-learn expects it to be
otherwise, I can change it.

I have another thing to discuss here before finalizing my proposal. Mr. Wei
Dong has informed me that there is another state-of-art approach he
implemented to perform ANN search and asked to see if that can be used in
the evaluating stage. This implementation is called
KGraph<http://www.kgraph.org/index.php?n=Main.Python>.
It has comparisons with FLANN, I have started reading it. Maybe you can
have a look into that as well.

Best regards,
Maheshakya


On Tue, Mar 18, 2014 at 4:08 AM, Robert Layton <[email protected]>wrote:

> Thanks for sending this.
>
> The grammar needs updating, but I'll let you worry about proofreading.
> Comments below:
> - State some *evidence* for your experience with python, i.e. projects
> you worked on or links to big commits (this is covered late in the
> application, move some information forward).
> - Don't choose a project because "it was on the list" -- why is the
> problem important and what can *you* do to help?
> - Be more confident! "This is the most vital part of this project as I
> think." Ensure the whole project is planned and evaluated.
> - Can you talk a bit more about which hashing algorithms will be
> implemented? This is a major part of the work, but isn't appropriately
> covered.
> - In your abstract, I would make certain your plan -- you want to
> implement prototypes, evaluate them and then finish the implementations.
> Make this absolutely clear (currently, the reader has to piece this
> together for themselves)
> - References need to be done more thoroughly.
>
> I think it's a good project, now you need to sell it!
>
>
>
>
> On 18 March 2014 06:55, Maheshakya Wijewardena <[email protected]>wrote:
>
>> Hi,
>>
>> I have written a proposal for this project and it is in the wiki,
>>
>> https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search
>>
>> It is not complete yet. I have not written the timeline yet. But I have
>> described my plan for this project. I think the prototyping part of this
>> project should be started during the community bonding period, while fixing
>> other issues because that needs to be done as soon as possible. I will
>> update this with my timeline soon.
>>
>> Can you review this and give me your feedback.
>>
>> Thank you,
>> Maheshakya.
>>
>>
>> On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher <
>> [email protected]> wrote:
>>
>>> On 03/16/2014 08:59 PM, Olivier Grisel wrote:
>>> > 2014-03-14 19:40 GMT+01:00 Daniel Vainsencher <
>>> [email protected]>:
>>> >> Hi Olivier,
>>> >>
>>> >> Have you looked at the LSH-Forest (and to a lesser extent, Multiprobe)
>>> >> paper? I've used it in practice, and it makes a significant
>>> improvement
>>> >> vs. basic LSH. I've proposed in recent emails to the list how to
>>> >> implement it based on sorting and binary search (in order to avoid the
>>> >> dependency/maintenance burden of range-queries).
>>> >
>>> > No I did not read those papers (yet).
>>> I don't know the culture here, but since you seemed to summarize rather
>>> than wait for someone to read up on LSH:
>>> - Where LSH uses hash-maps keyed on the a random hash of length exactly
>>> k, LSHF uses an ordered map (binary tree or BTree in the paper). This
>>> enables queries of form:
>>> "all keys whose prefix of length h<=k matches this hash". This trivially
>>> enables a per tree query of type "give me the best q hashes (and maybe a
>>> few more)" by doing queries with h=k, h=k-1,... until satisfied (this
>>> can be optimized in different ways).
>>> - Then queries across multiple trees can be combined in different ways,
>>> the simplest being "give me at least q/L candidates from each tree, take
>>> the union". More complex variants don't seem to make much difference.
>>> - So, the parameter k becomes essentially per-query, L can be determined
>>> on its own, and we can determine the number of expensive "real distance"
>>> comparisons we want to pay per query.
>>>
>>> I offer the simple observation that sorted arrays support very fast
>>> range queries, so that complex/expensive trees are not necessary for an
>>> LSHF implementation.
>>>
>>> As Maheshakya said it might very
>>> > well be the method implemented in Annoy.
>>> It seems similar but distinct, see a recent mail.
>>>
>>> > Do you have sample python
>>> > code around?
>>> No, I implemented it in C#, and source is not available.
>>>
>>> >> I think evaluating and tweaking vanilla LSH is a waste of time, when
>>> an
>>> >> obvious improvement that fixes a significant real world flaw (control
>>> of
>>> >> number of candidates seen) in the method is available.
>>> > If the obvious improvement does not come with a significant increment
>>> > in code complexity (and maintenance burden) then I agree, yes.
>>> So, was the above was sufficient to answer your prerequisite?
>>>
>>> >Getting rid of hyperparameters to select manually of via grid-search
>>> > is a big usability improvement and can make the method much more
>>> > practical to use.
>>> In this case, performance becomes better than mere tuning would allow,
>>> since it adapts to database density around the query point.
>>>
>>> Daniel
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their
>>> applications. Written by three acclaimed leaders in the field,
>>> this first edition is now available. Download your free book today!
>>> http://p.sf.net/sfu/13534_NeoTech
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>>
>> --
>> Undergraduate,
>> Department of Computer Science and Engineering,
>> Faculty of Engineering.
>> University of Moratuwa,
>> Sri Lanka
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

Reply via email to