Re: [Scikit-learn-general] GSoC 2015 Proposal: Multiple Metric Learning

Vlad Niculae Tue, 24 Mar 2015 16:41:05 -0700

Hi Raghav, hi everyone,

If I may, I have a very high-level comment on your proposal. It clearly shows 
that you are very involved in the project and understand the internals well. 
However, I feel like it’s written from a way too technical perspective.  Your 
proposal contains implementation details, but little or no discussion of why 
each change is important and how it impacts users.  Taking a step back and 
writing such discussion can help gain perspective, which is important for 
planning.

This is equally important in terms of your weekly blog posts: they should 
provide an interesting read for more than just scikit-learn developers. So try 
not to think of your GSoC blog posts as a chore/requirement—they’re a great 
opportunity to reach out to the community. Your blog posts will be a great 
opportunity to show off scikit-learn’s ease of use and clean API for tasks that 
can normally get tedious to write manually. It won’t be as easy to write about 
them as it would be if you worked on some shiny new model, but if you do it 
right, this makes it even better: everybody needs cross-validation and model 
selection!

Which leads me to finer comments:

1. The design of multiple metric support is important and would bring an 
immense usability gain. At the moment, most non-trivial model selection cases 
require custom code.

A while ago there was a mailing list discussion about using Pandas data frames 
for managing the complex multi-dimensional structure that arises. Of course, 
scikit-learn will never have a Pandas dependency, but we can try to make it as 
easy as possible to return things that plug seamlessly with Pandas. We won’t be 
able to show this off in documentation and examples, but it can make for a 
shiny blog post.

2. Also on multiple metric support, you say “one iteration per metric (as is 
done currently).” What does this refer to, where is it done this way?

3. How does multiple metric support interfere with model selection APIs? 
Suddenly there is no more “best_{score|params|estimator}_”. There is an API 
discussion to be had there, and your review of possible options would be a 
great addition to the proposal.  For example, will model selection objects gain 
a “criterion” function, that maybe defaults to getting the first specified 
metric? If so, could this API be used to make global decisions, e.g. "the model 
which is within 1 standard error of the best score, but has the largest C?” Or 
should it essentially just return a number per parameter configuration, that we 
then sort by?

4. There is another API discussion about `sample_weight`: is that the only 
parameter that we want to route to scoring? I have some applications where I 
want some notion of `sample_group`. (This would allow to use scikit-learn 
directly for e.g. query-grouped search results ranking.)  I proposed the 
`sample_*` API convention but it has quite a few downsides; if I remember 
correctly Joel proposed a param_routing API where you would pass a routing dict 
{‘sample_group’: ‘fit’, ‘score’}: such an API would be much more extensible.

Wrapping up 3+4 I would make sure to reserve time in the timeline for API 
discussion and convergence, especially given that we are trying to reach an API 
freeze. This will *not* be easy. It wouldn’t hurt to factor in time for PR 
review as well. This might make you rethink the timeline a bit.

5. Nitpicks:
* There are some empty spaces in your proposal: 4, 5 in the abstract, 5, 6 in 
the details section, and two weeks in the timeline.
* updation -> update
* Mr. Blondel’s first name is spelled Mathieu :)
* I would try to rephrase point #8 in the detailed section. Reading the 
proposal I had no idea what that point is saying.
* There’s something left over about Nesterov momentum in the timeline.
* Are you seriously planning to work 8x7? I thought full time means 8x5.
* In “About me” you spell Python inconsistently (should be uppercased), "no 
where" -> nowhere, “I, nevertheless” -> “I nevertheless”, september -> 
September.

Hope all my comments can help strengthen your proposal!

Yours,
Vlad

> On 24 Mar 2015, at 08:40, Joel Nothman <joel.noth...@gmail.com> wrote:
> 
> I agree with everything Andy says. I think the core developers are very 
> enthusiastic to have a project along the lines of "Finish all the things that 
> need finishing", but it's very impractical to do so much context switching 
> both for students and mentors/reviewers.
> 
> One of the advantages of GSoC is that it creates specialisation: on the one 
> hand, a user becomes expert in what they tackle; on the other, reviewers and 
> mentors can limit their attention to the topic at hand. So please, try to 
> focus a little more.
> 
> On 24 March 2015 at 08:40, Andreas Mueller <t3k...@gmail.com> wrote:
> Hi Raghav.
> 
> I feel that your proposal lacks some focus.
> I'd remove the two:
> 
> Mallow's Cp for LASSO / LARS
> Implement built in abs max scaler, Nesterov's momentum and finish up the 
> Multilayer Perceptron module.
> 
> And as discussed in this thread probably also
> Forge a self sufficient ML tutorial based on scikit-learn.
> 
> If you feel like you proposal has not enough material (not sure about that),
> two things that could be added and are more related to the cross-validation 
> and grid-search part
> (but probably difficult from an API standpoint) are making CV objects (aka 
> path algorithms, or generalized cross-validation)
> work together with GridSearchCV.
> The other would be how to allow early stopping using a validation set.
> The two are probably related (imho).
> 
> Olivier also mentioned cross-validation for out-of-core (partial_fit) 
> algorithms.
> I feel that is not as important, but might also tie into your proposal.
> 
> Finishing the refactoring of model_evaluation in three days seems a bit 
> optimistic, if you include reviews.
> 
> For sample_weight support, I'm not if there are obvious ways to extend 
> sample_weight to all the algorithms that you mentioned.
> How does it work for spectral clustering and agglomerative clustering for 
> example?
> 
> In general, I feel you should rather focus on less things, and more on the 
> details of what to do there.
> Otherwise the proposal looks good.
> For the wiki, having links to the issues might be helpful.
> 
> Thanks for the application :)
> 
> Andy
> 
> On 03/22/2015 08:52 PM, Raghav R V wrote:
>> 2 things :
>> 
>> * The subject should have been "Multiple Metric Support in grid_search and 
>> cross_validation modules and other general improvements" and not multiple 
>> metric learning! Sorry for that!
>> * The link was not available due to the trailing "." (dot), which has been 
>> fixed now!
>> 
>> Thanks
>> R
>> 
>> On Mon, Mar 23, 2015 at 5:47 AM, Raghav R V <rag...@gmail.com> wrote:
>> 1. the link is broken
>> 
>> Ah! Sorry :) - 
>> https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements.
>>  
>> 
>> 2. that sounds quite difficult and unfortunately conducive to cheating
>> 
>> Hmm... Should I then simply opt for adding more examples then?
>> 
>> 
>> 
>> On Sun, Mar 22, 2015 at 7:57 PM, Raghav R V <rag...@gmail.com> wrote:
>> Hi,
>> 
>> 1. This is my proposal for the multiple metric learning project as a wiki 
>> page  - 
>> https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements.
>> 
>> Possible mentors : Andreas Mueller (amueller) and Joel Nothman (jnothman)
>> 
>>   Any feedback/suggestions/additions/deletions would be awesome. :)
>> 
>> 2. Given that there is a huge interest among students in learning about ML, 
>> do you think it would be within the scope of/beneficial to skl to have all 
>> the exercises and/or concepts, from a good quality book (ESL / PRML / 
>> Murphy) or an academic course like NG's CS229 (not the less rigorous 
>> coursera version), implemented using sklearn? Or perhaps we could instead 
>> enhance our tutorials and examples, to be a self study guide to learn about 
>> ML?
>> I have included this in my GSoC proposal but was not quite sure if this 
>> would be an useful idea!!
>> 
>> Or would it be better if I simply add more examples?
>> 
>> Please let me know your views!!
>> 
>> Thanks
>> 
>> 
>> R
>> 
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website, 
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>> all
>> things parallel software development, from weekly thought leadership blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website, 
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>> all
>> things parallel software development, from weekly thought leadership blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website, 
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>> all
>> things parallel software development, from weekly thought leadership blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the 
>> conversation now. 
>> http://goparallel.sourceforge.net/
>> 
>> 
>> _______________________________________________
>> Scikit-learn-general mailing list
>> 
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the 
> conversation now. 
> http://goparallel.sourceforge.net/_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GSoC 2015 Proposal: Multiple Metric Learning

Reply via email to