Hi Raghav, hi everyone, If I may, I have a very high-level comment on your proposal. It clearly shows that you are very involved in the project and understand the internals well. However, I feel like it’s written from a way too technical perspective. Your proposal contains implementation details, but little or no discussion of why each change is important and how it impacts users. Taking a step back and writing such discussion can help gain perspective, which is important for planning.
This is equally important in terms of your weekly blog posts: they should provide an interesting read for more than just scikit-learn developers. So try not to think of your GSoC blog posts as a chore/requirement—they’re a great opportunity to reach out to the community. Your blog posts will be a great opportunity to show off scikit-learn’s ease of use and clean API for tasks that can normally get tedious to write manually. It won’t be as easy to write about them as it would be if you worked on some shiny new model, but if you do it right, this makes it even better: everybody needs cross-validation and model selection! Which leads me to finer comments: 1. The design of multiple metric support is important and would bring an immense usability gain. At the moment, most non-trivial model selection cases require custom code. A while ago there was a mailing list discussion about using Pandas data frames for managing the complex multi-dimensional structure that arises. Of course, scikit-learn will never have a Pandas dependency, but we can try to make it as easy as possible to return things that plug seamlessly with Pandas. We won’t be able to show this off in documentation and examples, but it can make for a shiny blog post. 2. Also on multiple metric support, you say “one iteration per metric (as is done currently).” What does this refer to, where is it done this way? 3. How does multiple metric support interfere with model selection APIs? Suddenly there is no more “best_{score|params|estimator}_”. There is an API discussion to be had there, and your review of possible options would be a great addition to the proposal. For example, will model selection objects gain a “criterion” function, that maybe defaults to getting the first specified metric? If so, could this API be used to make global decisions, e.g. "the model which is within 1 standard error of the best score, but has the largest C?” Or should it essentially just return a number per parameter configuration, that we then sort by? 4. There is another API discussion about `sample_weight`: is that the only parameter that we want to route to scoring? I have some applications where I want some notion of `sample_group`. (This would allow to use scikit-learn directly for e.g. query-grouped search results ranking.) I proposed the `sample_*` API convention but it has quite a few downsides; if I remember correctly Joel proposed a param_routing API where you would pass a routing dict {‘sample_group’: ‘fit’, ‘score’}: such an API would be much more extensible. Wrapping up 3+4 I would make sure to reserve time in the timeline for API discussion and convergence, especially given that we are trying to reach an API freeze. This will *not* be easy. It wouldn’t hurt to factor in time for PR review as well. This might make you rethink the timeline a bit. 5. Nitpicks: * There are some empty spaces in your proposal: 4, 5 in the abstract, 5, 6 in the details section, and two weeks in the timeline. * updation -> update * Mr. Blondel’s first name is spelled Mathieu :) * I would try to rephrase point #8 in the detailed section. Reading the proposal I had no idea what that point is saying. * There’s something left over about Nesterov momentum in the timeline. * Are you seriously planning to work 8x7? I thought full time means 8x5. * In “About me” you spell Python inconsistently (should be uppercased), "no where" -> nowhere, “I, nevertheless” -> “I nevertheless”, september -> September. Hope all my comments can help strengthen your proposal! Yours, Vlad > On 24 Mar 2015, at 08:40, Joel Nothman <joel.noth...@gmail.com> wrote: > > I agree with everything Andy says. I think the core developers are very > enthusiastic to have a project along the lines of "Finish all the things that > need finishing", but it's very impractical to do so much context switching > both for students and mentors/reviewers. > > One of the advantages of GSoC is that it creates specialisation: on the one > hand, a user becomes expert in what they tackle; on the other, reviewers and > mentors can limit their attention to the topic at hand. So please, try to > focus a little more. > > On 24 March 2015 at 08:40, Andreas Mueller <t3k...@gmail.com> wrote: > Hi Raghav. > > I feel that your proposal lacks some focus. > I'd remove the two: > > Mallow's Cp for LASSO / LARS > Implement built in abs max scaler, Nesterov's momentum and finish up the > Multilayer Perceptron module. > > And as discussed in this thread probably also > Forge a self sufficient ML tutorial based on scikit-learn. > > If you feel like you proposal has not enough material (not sure about that), > two things that could be added and are more related to the cross-validation > and grid-search part > (but probably difficult from an API standpoint) are making CV objects (aka > path algorithms, or generalized cross-validation) > work together with GridSearchCV. > The other would be how to allow early stopping using a validation set. > The two are probably related (imho). > > Olivier also mentioned cross-validation for out-of-core (partial_fit) > algorithms. > I feel that is not as important, but might also tie into your proposal. > > Finishing the refactoring of model_evaluation in three days seems a bit > optimistic, if you include reviews. > > For sample_weight support, I'm not if there are obvious ways to extend > sample_weight to all the algorithms that you mentioned. > How does it work for spectral clustering and agglomerative clustering for > example? > > In general, I feel you should rather focus on less things, and more on the > details of what to do there. > Otherwise the proposal looks good. > For the wiki, having links to the issues might be helpful. > > Thanks for the application :) > > Andy > > On 03/22/2015 08:52 PM, Raghav R V wrote: >> 2 things : >> >> * The subject should have been "Multiple Metric Support in grid_search and >> cross_validation modules and other general improvements" and not multiple >> metric learning! Sorry for that! >> * The link was not available due to the trailing "." (dot), which has been >> fixed now! >> >> Thanks >> R >> >> On Mon, Mar 23, 2015 at 5:47 AM, Raghav R V <rag...@gmail.com> wrote: >> 1. the link is broken >> >> Ah! Sorry :) - >> https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements. >> >> >> 2. that sounds quite difficult and unfortunately conducive to cheating >> >> Hmm... Should I then simply opt for adding more examples then? >> >> >> >> On Sun, Mar 22, 2015 at 7:57 PM, Raghav R V <rag...@gmail.com> wrote: >> Hi, >> >> 1. This is my proposal for the multiple metric learning project as a wiki >> page - >> https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements. >> >> Possible mentors : Andreas Mueller (amueller) and Joel Nothman (jnothman) >> >> Any feedback/suggestions/additions/deletions would be awesome. :) >> >> 2. Given that there is a huge interest among students in learning about ML, >> do you think it would be within the scope of/beneficial to skl to have all >> the exercises and/or concepts, from a good quality book (ESL / PRML / >> Murphy) or an academic course like NG's CS229 (not the less rigorous >> coursera version), implemented using sklearn? Or perhaps we could instead >> enhance our tutorials and examples, to be a self study guide to learn about >> ML? >> I have included this in my GSoC proposal but was not quite sure if this >> would be an useful idea!! >> >> Or would it be better if I simply add more examples? >> >> Please let me know your views!! >> >> Thanks >> >> >> R >> >> ------------------------------------------------------------------------------ >> Dive into the World of Parallel Programming The Go Parallel Website, >> sponsored >> by Intel and developed in partnership with Slashdot Media, is your hub for >> all >> things parallel software development, from weekly thought leadership blogs to >> news, videos, case studies, tutorials and more. Take a look and join the >> conversation now. http://goparallel.sourceforge.net/ >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> >> ------------------------------------------------------------------------------ >> Dive into the World of Parallel Programming The Go Parallel Website, >> sponsored >> by Intel and developed in partnership with Slashdot Media, is your hub for >> all >> things parallel software development, from weekly thought leadership blogs to >> news, videos, case studies, tutorials and more. Take a look and join the >> conversation now. http://goparallel.sourceforge.net/ >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> >> >> >> ------------------------------------------------------------------------------ >> Dive into the World of Parallel Programming The Go Parallel Website, >> sponsored >> by Intel and developed in partnership with Slashdot Media, is your hub for >> all >> things parallel software development, from weekly thought leadership blogs to >> news, videos, case studies, tutorials and more. Take a look and join the >> conversation now. >> http://goparallel.sourceforge.net/ >> >> >> _______________________________________________ >> Scikit-learn-general mailing list >> >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for all > things parallel software development, from weekly thought leadership blogs to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for all > things parallel software development, from weekly thought leadership blogs to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. > http://goparallel.sourceforge.net/_______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general