>
> My thinking was to implement "Basic LSH with basic data structures" and
> then spend some of the time working on seeing if moderate improvements
> (i.e. a more complex data structure) can deliver benefits.
When complexity of the data structures are considered, the standard method
contains multiple hash tables (which can be numpy 2d arrays) and the forest
contains binary trees (which can also be implemented with numpy arrays).
The extra operation required when indexing is binary search. But when the
benefits of the forest method can easily compensate for the this minor
complexity.
Yes. But ideally there would be other applications of LSH than KNN. That
> would contribute to the 'benefit' part of the cost/benefit analysis.
Currently, the scope of this project is limited to approximate neighbor
search using LSH. But the range of applications of LSH is much wider. This
can even be implemented with Hash kernel (Hashing trick in
sklearn.feature_extraction.). Further, it can be Hierarchical clustering as
well. But, since the goal of this project is mainly to approximate neighbor
search. doesn't the evaluation criteria concentrate more on that?
To decide I think the best way to proceed would be to have a
> evaluation / prototyping stage in the project that would first hack an
> implementation in of the basic methods (at least random projections
> and maybe others) in a gist outside of the scikit-learn codebase, not
> to worry about documentation and compliance with the API and benchmark
> it / profile it on representative datasets with various statistical
> structures and compare the outcome to alternative methods such as the
> methods implemented in FLANN and Annoy.
Yes. This can be done as a pre-step for the project and find the best
implementation method. So I better include that part as well in my proposal.
Maheshakya.
On Fri, Mar 14, 2014 at 3:45 AM, Robert Layton <[email protected]>wrote:
> Thanks Gael. My thinking was to implement "Basic LSH with basic data
> structures" and then spend some of the time working on seeing if moderate
> improvements (i.e. a more complex data structure) can deliver benefits.
> This way, we get the key deliverable, and spend some time trying to see if
> we can do better.
>
> I'd also like to see scalability added to the evaluation criteria!
>
>
> On 14 March 2014 03:59, Gael Varoquaux <[email protected]>wrote:
>
>> On Thu, Mar 13, 2014 at 09:57:27AM -0700, Gael Varoquaux wrote:
>> > On Thu, Mar 13, 2014 at 10:12:04AM +0200, Daniel Vainsencher wrote:
>> > > Other data structures worth evaluating (without depending on them) are
>> > > BTrees
>>
>> > Let's really avoid this for now. It's not like we have too many people
>> > who understand the codebase and are available to maintain it.
>>
>> To stress things again: we often see projects, whether they are GSOC or
>> elsewhere, that fail to deliver, because people implement plenty of fancy
>> things, and do not focus on the quality and maintainability. At the end
>> of the day, these projects leave _nothing_ for scikit-learn.
>>
>> There is no point shooting first for the moon. Remember the 80/20
>> tradeoff rules: 80% of the benefits (in term of usecases, speed, ...) can
>> be achieved with 20% of the efforts.
>>
>> Gaƫl
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general