Hi Vlad!!

Thanks a tonne for the detailed review of my proposal. :)

> Your proposal contains implementation details, but little or no
discussion of why each change is important and how it impacts users

Yes, I'll add a section discussing the motivation of the various
deliverable. (which actually need to be strengthened a bit)

> weekly blog posts [...] they’re a great opportunity to reach out to the
community. [...] plug seamlessly with Pandas. We won’t be able to show this
off in documentation and examples, but it can make for a shiny blog post.

Sure! this is an interesting perspective. Quite frankly I used to blog with
a dev audience in mind... From now on, I'll make sure my weekly blogs
showcase an important feature that was recently contributed and I'll also
write it with our user base as audience!

By "one iteration per metric (as is done currently).” I meant we currently
refit grid search for every single metric that we want the model to be
optimized with respect to... (I think I worded this wrong! Thanks for
pointing it out! )

> How does multiple metric support interfere with model selection APIs?

Refactoring the search / cv objects into model_selection involves splitting
the files / moving parts of code around to another file without a clean git
move (git is blind to such moves) and hence any merged changes which
touches the code of grid search et all needs to be manually rebased which
may be error prone... This was the reason I intend to work on it during the
month of April itself and would love to see it merged ASAP...

> Suddenly there is no more “best_{score|params|estimator}_”. There is an
API discussion to be had there, and your review of possible options would
be a great addition to the proposal.

Yes sure I'll add a few lines discussing the same..

This also reminds me of the following related issues (#2733, #1034/#1020,
#2079/#1842) which all have great ideas, which could be used too...

> There is another API discussion about `sample_weight` [...] Wrapping up
3+4 I would make sure to reserve time in the timeline for API discussion
and convergence [...]

Hmm thanks! I really need to decide on the amount of time that can be
justifiably allocated for discussions... It is probably better to stack my
deliverables in pipeline with multiple goals where one can be done in
parallel while the other is being reviewed.

> There’s something left over about Nesterov momentum in the timeline.

I should have removed that! sorry... (It was a left over from a previous
version of my prop.)

> Mr. Blondel’s first name is spelled Mathieu

Ah sorry Mathieu ;)

> Are you seriously planning to work 8x7? I thought full time means 8x5.

Yes I am okay with this :) At least I think so :p


Thanks a lot for all your comments! I will address them along with the
other comments and will reflect the changes back to the wiki too :)


Have a great day!! :)


R

On Wed, Mar 25, 2015 at 5:09 AM, Vlad Niculae <zephy...@gmail.com> wrote:

> Hi Raghav, hi everyone,
>
> If I may, I have a very high-level comment on your proposal. It clearly
> shows that you are very involved in the project and understand the
> internals well. However, I feel like it’s written from a way too technical
> perspective.  Your proposal contains implementation details, but little or
> no discussion of why each change is important and how it impacts users.
> Taking a step back and writing such discussion can help gain perspective,
> which is important for planning.
>
> This is equally important in terms of your weekly blog posts: they should
> provide an interesting read for more than just scikit-learn developers. So
> try not to think of your GSoC blog posts as a chore/requirement—they’re a
> great opportunity to reach out to the community. Your blog posts will be a
> great opportunity to show off scikit-learn’s ease of use and clean API for
> tasks that can normally get tedious to write manually. It won’t be as easy
> to write about them as it would be if you worked on some shiny new model,
> but if you do it right, this makes it even better: everybody needs
> cross-validation and model selection!
>
> Which leads me to finer comments:
>
> 1. The design of multiple metric support is important and would bring an
> immense usability gain. At the moment, most non-trivial model selection
> cases require custom code.
>
> A while ago there was a mailing list discussion about using Pandas data
> frames for managing the complex multi-dimensional structure that arises. Of
> course, scikit-learn will never have a Pandas dependency, but we can try to
> make it as easy as possible to return things that plug seamlessly with
> Pandas. We won’t be able to show this off in documentation and examples,
> but it can make for a shiny blog post.
>
> 2. Also on multiple metric support, you say “one iteration per metric (as
> is done currently).” What does this refer to, where is it done this way?
>
> 3. How does multiple metric support interfere with model selection APIs?
> Suddenly there is no more “best_{score|params|estimator}_”. There is an API
> discussion to be had there, and your review of possible options would be a
> great addition to the proposal.  For example, will model selection objects
> gain a “criterion” function, that maybe defaults to getting the first
> specified metric? If so, could this API be used to make global decisions,
> e.g. "the model which is within 1 standard error of the best score, but has
> the largest C?” Or should it essentially just return a number per parameter
> configuration, that we then sort by?
>
> 4. There is another API discussion about `sample_weight`: is that the only
> parameter that we want to route to scoring? I have some applications where
> I want some notion of `sample_group`. (This would allow to use scikit-learn
> directly for e.g. query-grouped search results ranking.)  I proposed the
> `sample_*` API convention but it has quite a few downsides; if I remember
> correctly Joel proposed a param_routing API where you would pass a routing
> dict {‘sample_group’: ‘fit’, ‘score’}: such an API would be much more
> extensible.
>
> Wrapping up 3+4 I would make sure to reserve time in the timeline for API
> discussion and convergence, especially given that we are trying to reach an
> API freeze. This will *not* be easy. It wouldn’t hurt to factor in time for
> PR review as well. This might make you rethink the timeline a bit.
>
> 5. Nitpicks:
> * There are some empty spaces in your proposal: 4, 5 in the abstract, 5, 6
> in the details section, and two weeks in the timeline.
> * updation -> update
> * Mr. Blondel’s first name is spelled Mathieu :)
> * I would try to rephrase point #8 in the detailed section. Reading the
> proposal I had no idea what that point is saying.
> * There’s something left over about Nesterov momentum in the timeline.
> * Are you seriously planning to work 8x7? I thought full time means 8x5.
> * In “About me” you spell Python inconsistently (should be uppercased),
> "no where" -> nowhere, “I, nevertheless” -> “I nevertheless”, september ->
> September.
>
> Hope all my comments can help strengthen your proposal!
>
> Yours,
> Vlad
>
> > On 24 Mar 2015, at 08:40, Joel Nothman <joel.noth...@gmail.com> wrote:
> >
> > I agree with everything Andy says. I think the core developers are very
> enthusiastic to have a project along the lines of "Finish all the things
> that need finishing", but it's very impractical to do so much context
> switching both for students and mentors/reviewers.
> >
> > One of the advantages of GSoC is that it creates specialisation: on the
> one hand, a user becomes expert in what they tackle; on the other,
> reviewers and mentors can limit their attention to the topic at hand. So
> please, try to focus a little more.
> >
> > On 24 March 2015 at 08:40, Andreas Mueller <t3k...@gmail.com> wrote:
> > Hi Raghav.
> >
> > I feel that your proposal lacks some focus.
> > I'd remove the two:
> >
> > Mallow's Cp for LASSO / LARS
> > Implement built in abs max scaler, Nesterov's momentum and finish up the
> Multilayer Perceptron module.
> >
> > And as discussed in this thread probably also
> > Forge a self sufficient ML tutorial based on scikit-learn.
> >
> > If you feel like you proposal has not enough material (not sure about
> that),
> > two things that could be added and are more related to the
> cross-validation and grid-search part
> > (but probably difficult from an API standpoint) are making CV objects
> (aka path algorithms, or generalized cross-validation)
> > work together with GridSearchCV.
> > The other would be how to allow early stopping using a validation set.
> > The two are probably related (imho).
> >
> > Olivier also mentioned cross-validation for out-of-core (partial_fit)
> algorithms.
> > I feel that is not as important, but might also tie into your proposal.
> >
> > Finishing the refactoring of model_evaluation in three days seems a bit
> optimistic, if you include reviews.
> >
> > For sample_weight support, I'm not if there are obvious ways to extend
> sample_weight to all the algorithms that you mentioned.
> > How does it work for spectral clustering and agglomerative clustering
> for example?
> >
> > In general, I feel you should rather focus on less things, and more on
> the details of what to do there.
> > Otherwise the proposal looks good.
> > For the wiki, having links to the issues might be helpful.
> >
> > Thanks for the application :)
> >
> > Andy
> >
> > On 03/22/2015 08:52 PM, Raghav R V wrote:
> >> 2 things :
> >>
> >> * The subject should have been "Multiple Metric Support in grid_search
> and cross_validation modules and other general improvements" and not
> multiple metric learning! Sorry for that!
> >> * The link was not available due to the trailing "." (dot), which has
> been fixed now!
> >>
> >> Thanks
> >> R
> >>
> >> On Mon, Mar 23, 2015 at 5:47 AM, Raghav R V <rag...@gmail.com> wrote:
> >> 1. the link is broken
> >>
> >> Ah! Sorry :) -
> https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
> .
> >>
> >> 2. that sounds quite difficult and unfortunately conducive to cheating
> >>
> >> Hmm... Should I then simply opt for adding more examples then?
> >>
> >>
> >>
> >> On Sun, Mar 22, 2015 at 7:57 PM, Raghav R V <rag...@gmail.com> wrote:
> >> Hi,
> >>
> >> 1. This is my proposal for the multiple metric learning project as a
> wiki page  -
> https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
> .
> >>
> >> Possible mentors : Andreas Mueller (amueller) and Joel Nothman
> (jnothman)
> >>
> >>   Any feedback/suggestions/additions/deletions would be awesome. :)
> >>
> >> 2. Given that there is a huge interest among students in learning about
> ML, do you think it would be within the scope of/beneficial to skl to have
> all the exercises and/or concepts, from a good quality book (ESL / PRML /
> Murphy) or an academic course like NG's CS229 (not the less rigorous
> coursera version), implemented using sklearn? Or perhaps we could instead
> enhance our tutorials and examples, to be a self study guide to learn about
> ML?
> >> I have included this in my GSoC proposal but was not quite sure if this
> would be an useful idea!!
> >>
> >> Or would it be better if I simply add more examples?
> >>
> >> Please let me know your views!!
> >>
> >> Thanks
> >>
> >>
> >> R
> >>
> >>
> ------------------------------------------------------------------------------
> >> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> >> by Intel and developed in partnership with Slashdot Media, is your hub
> for all
> >> things parallel software development, from weekly thought leadership
> blogs to
> >> news, videos, case studies, tutorials and more. Take a look and join the
> >> conversation now. http://goparallel.sourceforge.net/
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> >> by Intel and developed in partnership with Slashdot Media, is your hub
> for all
> >> things parallel software development, from weekly thought leadership
> blogs to
> >> news, videos, case studies, tutorials and more. Take a look and join the
> >> conversation now. http://goparallel.sourceforge.net/
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >>
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> >> by Intel and developed in partnership with Slashdot Media, is your hub
> for all
> >> things parallel software development, from weekly thought leadership
> blogs to
> >> news, videos, case studies, tutorials and more. Take a look and join the
> >> conversation now.
> >> http://goparallel.sourceforge.net/
> >>
> >>
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >>
> >> Scikit-learn-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> ------------------------------------------------------------------------------
> > Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> > by Intel and developed in partnership with Slashdot Media, is your hub
> for all
> > things parallel software development, from weekly thought leadership
> blogs to
> > news, videos, case studies, tutorials and more. Take a look and join the
> > conversation now. http://goparallel.sourceforge.net/
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> ------------------------------------------------------------------------------
> > Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> > by Intel and developed in partnership with Slashdot Media, is your hub
> for all
> > things parallel software development, from weekly thought leadership
> blogs to
> > news, videos, case studies, tutorials and more. Take a look and join the
> > conversation now.
> http://goparallel.sourceforge.net/_______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to