I totally agree with Jake. However, I also think that a few general tutorials on “preprocessing” of “clean” datasets (clean in terms of missing values, duplicates, outliers have been dealt with) could be useful to a broader, interdisciplinary audience. For example:
- encoding class labels, encoding nominal vs ordinal feature variables - feature scaling and explaining when it matters (convex optimization vs tree-based algos etc.) - partial_fit & dimensionality reduction for data compression if data is too large for a typical desktop machine and estimators that don’t support partial_fit; also talking about partial_fit of the dim reduction transformers These are actually very important topics, and I noticed that they typically fall a little bit short in the general ML tutorials; typically, because these tutorials work with a single, specific dataset. Unfortunately, I have seen a couple of applications where nominal string variables were encoded as non-binary integers {1, 2, 3, 4, …}, which may work (i.e., the code executes without error) but is not the optimal way to do it. Best, Sebastian > On Sep 30, 2015, at 7:54 PM, Jacob Vanderplas <jake...@cs.washington.edu> > wrote: > > Hi, > The problem with including data munging in the tutorial is that it's not > really a machine learning question. Solutions are generally so > domain-specific that you can't present it in a way that would be generally > useful to an interdisciplinary audience. This is why most (all?) short > machine learning tutorials ignore the data cleaning aspect and instead focus > on the machine learning algorithms & concepts – and in my tutorials, I always > try to emphasize the fact that I'm leaving this part up to the user (and > perhaps point them to the pandas tutorial, if one is being offered). > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Research in Physical Sciences > University of Washington eScience Institute > > On Wed, Sep 30, 2015 at 4:41 PM, KAB <kha...@yahoo.com> wrote: > Hello Jake and Andy, > > If you would not mind some advice, I would suggest including examples (or at > least one) where you use data that is not built-in. I remember the first > several tutorials (if not all of them) relied completely on built-in data > sets and unapologetically ignored the big elephant in the room that people > will need to import/read-in their own data and have to deal with it in > scikit-learn one way or another, either through pandas or numpy and these > will then hand the data over to the appropriate scikit-learn routines. > > Ignoring coverage of this aspect (and likewise the issue of how to deal with > categorical data in data sets), in such tutorials, in my humble opinion > presents a somewhat uneasy hurdle to getting started with the scikit-learn > tool set. I for one had to use R just to overcome these issues when I first > started with this, even though I would have preferred to use Python and its > data science stack due to my experience with and preference of Python over R. > > Best regards > > > > On 9/30/2015 8:22 PM, Andy wrote: >> Hi Jake. >> I think the tutorial Kyle and I did based on the previous tutorials was >> working quite well. >> I think it would make sense to work of our scipy ones and improve them >> further. >> I'd be happy to work on it. >> We have some more exercises in a branch, and I have also improved versions >> of some of the notebooks that I have been using for teaching. >> >> Andy >> >> >> On 09/29/2015 06:48 PM, Jacob Vanderplas wrote: >>> Hi All, >>> PyCon 2016 call for proposals just opened. For the last several years >>> Olivier and I have been teaching a two-part scikit-learn tutorial at each >>> PyCon, and I think they have gone over well. >>> >>> As the conference is just a few hour train ride away for me this year, I'm >>> certainly going to attend again. I'd also love to put together one or more >>> scikit-learn tutorials again this year – if you're planning to attend PyCon >>> and would like to work together on a proposal or two, let me know! >>> Jake >>> >>> Jake VanderPlas >>> Senior Data Science Fellow >>> Director of Research in Physical Sciences >>> University of Washington eScience Institute >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> Scikit-learn-general mailing list >> >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general