Hi all, Interesting discussion – thanks for the thoughts and ideas! I like the idea of a Data Munging tutorial being separate from an ML tutorial. With only 3 hours, that seems more doable than trying to squeeze it all into one session. Perhaps Stephan's idea of an image-focused ML tutorial could be a kind of case study in domain-specific data munging.
I'd echo Kyle's comments on time being a huge constraint. I've given some version of my basic scikit-learn tutorial probably two dozen times now, and I find that 3 hours is basically enough to cover data format (i.e. n_samples x n_features), the estimator API, the basics of supervised vs unsupervised learning, and a brief deeper dive into two or three estimators (I usually cover SVMs, Random Forests, and K-Means). That might not seem like a lot of material, but for an intro audience who asks a lot of questions, it can easily fill the time. This weekend Stephan, Andy, and I will be at a meeting together near Seattle. Perhaps the three of us can chat a bit in person and see how we might tap all of this enthusiasm & incorporate the ideas presented here. I think they will go over well: having been on the PyCon tutorial committee for the past several years, I know that there has been demand for increased data-science-related tutorial topics. I can also tell you that if we present a "united front", so to speak, with some thought put in to the flow between multiple related tutorials, we'll have a better chance of getting them accepted (though keep in mind that if submissions are similar to the last few years, only about 1 in 4 proposals will be accepted overall!) Thanks, Jake Jake VanderPlas Senior Data Science Fellow Director of Research in Physical Sciences University of Washington eScience Institute On Wed, Sep 30, 2015 at 7:06 PM, Kyle Kastner <kastnerk...@gmail.com> wrote: > If people are planning to work on this, it would be good to check what > Andy and I presented at SciPy, which is based on what Jake and Olivier > did at PyCon (and what Andy, Jake and Gael did at SciPy 2013, etc. > etc.). > > To Sebastian's points - we covered all of these nearly verbatim except > perhaps class imbalance (maybe in the spam example? Don't recall > explicitly covering this, though it was requested in the course > feedback). We also covered "out of sklearn" data fairly extensively > loading from CSV and preprocessing several datasets. See specifically > the case studies here > > https://github.com/amueller/scipy_2015_sklearn_tutorial/tree/master/notebooks > . > > An issue is that really big dataset downloads tend to melt conference > wifi, and even sending many reminder emails to clone and download all > the data will only slightly reduce the number of all at once > downloads, so there is a balance between "interesting" and "large" at > play. Personally, I think it would be nice to show an image example > using skimage, and maybe something more esoteric but that might be an > issue since the tutorial is already quite crowded. > > One of the key things I noticed is *time* is a huge issue - even at > SciPy which has a slightly more technical base level than PyCon, we > ran out of time to cover these topics to the appropriate depth (with 2 > 4hr sessions!). Covering things sufficiently well for introductory > students while also providing enough tips for people doing this in > practice is hard, and the breadth of experience at PyCon will probably > be even more difficult to cover than SciPy. Something to be aware of, > at least. > > Andy has also been working on some presentations/courses for the book > which might be useful here, though I don't know what state they are in > at the current moment. > > As Andy said, we have some solutions in a branch and I would be glad > to help get this set up. I don't know whether I will be there or not > just yet, but and hope to attend and could maybe teach if more hands > are needed and my schedule works out. > > On Wed, Sep 30, 2015 at 8:58 PM, Chris Waigl <cwa...@alaska.edu> wrote: > > I believe a “Data cleaning and preprocessing for data science” > > (insert-snappier-title-here) tutorial would be a great addition to a > PyCon. > > It’s a prerequisite for machine learning, that’s sure. A machine learning > > tutorial should probably not completely sweep it under the carpet, but > treat > > it in the briefest ways at last until we have a place / set of resources > to > > point people to. This is still an under-served area. > > > > Best, > > > > Chris > > > > -- > > Christine (Chris) Waigl - cwa...@alaska.edu - +1-907-474-5483 - Skype: > > cwaigl_work > > Geophysical Institute, UAF, 903 Koyukuk Drive, Fairbanks, AK 99775-7320, > USA > > > > > > > > > > > > > > > > On Sep 30, 2015, at 4:39 PM, Sebastian Raschka <se.rasc...@gmail.com> > wrote: > > > > I totally agree with Jake. However, I also think that a few general > > tutorials on “preprocessing” of “clean” datasets (clean in terms of > missing > > values, duplicates, outliers have been dealt with) could be useful to a > > broader, interdisciplinary audience. For example: > > > > - encoding class labels, encoding nominal vs ordinal feature variables > > - feature scaling and explaining when it matters (convex optimization vs > > tree-based algos etc.) > > - partial_fit & dimensionality reduction for data compression if data is > too > > large for a typical desktop machine and estimators that don’t support > > partial_fit; also talking about partial_fit of the dim reduction > > transformers > > > > These are actually very important topics, and I noticed that they > typically > > fall a little bit short in the general ML tutorials; typically, because > > these tutorials work with a single, specific dataset. Unfortunately, I > have > > seen a couple of applications where nominal string variables were > encoded as > > non-binary integers {1, 2, 3, 4, …}, which may work (i.e., the code > executes > > without error) but is not the optimal way to do it. > > > > Best, > > Sebastian > > > > On Sep 30, 2015, at 7:54 PM, Jacob Vanderplas <jake...@cs.washington.edu > > > > wrote: > > > > Hi, > > The problem with including data munging in the tutorial is that it's not > > really a machine learning question. Solutions are generally so > > domain-specific that you can't present it in a way that would be > generally > > useful to an interdisciplinary audience. This is why most (all?) short > > machine learning tutorials ignore the data cleaning aspect and instead > focus > > on the machine learning algorithms & concepts – and in my tutorials, I > > always try to emphasize the fact that I'm leaving this part up to the > user > > (and perhaps point them to the pandas tutorial, if one is being offered). > > Jake > > > > Jake VanderPlas > > Senior Data Science Fellow > > Director of Research in Physical Sciences > > University of Washington eScience Institute > > > > On Wed, Sep 30, 2015 at 4:41 PM, KAB <kha...@yahoo.com> wrote: > > Hello Jake and Andy, > > > > If you would not mind some advice, I would suggest including examples > (or at > > least one) where you use data that is not built-in. I remember the first > > several tutorials (if not all of them) relied completely on built-in data > > sets and unapologetically ignored the big elephant in the room that > people > > will need to import/read-in their own data and have to deal with it in > > scikit-learn one way or another, either through pandas or numpy and these > > will then hand the data over to the appropriate scikit-learn routines. > > > > Ignoring coverage of this aspect (and likewise the issue of how to deal > with > > categorical data in data sets), in such tutorials, in my humble opinion > > presents a somewhat uneasy hurdle to getting started with the > scikit-learn > > tool set. I for one had to use R just to overcome these issues when I > first > > started with this, even though I would have preferred to use Python and > its > > data science stack due to my experience with and preference of Python > over > > R. > > > > Best regards > > > > > > > > On 9/30/2015 8:22 PM, Andy wrote: > > > > Hi Jake. > > I think the tutorial Kyle and I did based on the previous tutorials was > > working quite well. > > I think it would make sense to work of our scipy ones and improve them > > further. > > I'd be happy to work on it. > > We have some more exercises in a branch, and I have also improved > versions > > of some of the notebooks that I have been using for teaching. > > > > Andy > > > > > > On 09/29/2015 06:48 PM, Jacob Vanderplas wrote: > > > > Hi All, > > PyCon 2016 call for proposals just opened. For the last several years > > Olivier and I have been teaching a two-part scikit-learn tutorial at each > > PyCon, and I think they have gone over well. > > > > As the conference is just a few hour train ride away for me this year, > I'm > > certainly going to attend again. I'd also love to put together one or > more > > scikit-learn tutorials again this year – if you're planning to attend > PyCon > > and would like to work together on a proposal or two, let me know! > > Jake > > > > Jake VanderPlas > > Senior Data Science Fellow > > Director of Research in Physical Sciences > > University of Washington eScience Institute > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > _______________________________________________ > > Scikit-learn-general mailing list > > > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > _______________________________________________ > > Scikit-learn-general mailing list > > > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general