I believe a “Data cleaning and preprocessing for data science” (insert-snappier-title-here) tutorial would be a great addition to a PyCon. It’s a prerequisite for machine learning, that’s sure. A machine learning tutorial should probably not completely sweep it under the carpet, but treat it in the briefest ways at last until we have a place / set of resources to point people to. This is still an under-served area.
Best, Chris -- Christine (Chris) Waigl - cwa...@alaska.edu - +1-907-474-5483 - Skype: cwaigl_work Geophysical Institute, UAF, 903 Koyukuk Drive, Fairbanks, AK 99775-7320, USA > On Sep 30, 2015, at 4:39 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote: > > I totally agree with Jake. However, I also think that a few general tutorials > on “preprocessing” of “clean” datasets (clean in terms of missing values, > duplicates, outliers have been dealt with) could be useful to a broader, > interdisciplinary audience. For example: > > - encoding class labels, encoding nominal vs ordinal feature variables > - feature scaling and explaining when it matters (convex optimization vs > tree-based algos etc.) > - partial_fit & dimensionality reduction for data compression if data is too > large for a typical desktop machine and estimators that don’t support > partial_fit; also talking about partial_fit of the dim reduction transformers > > These are actually very important topics, and I noticed that they typically > fall a little bit short in the general ML tutorials; typically, because these > tutorials work with a single, specific dataset. Unfortunately, I have seen a > couple of applications where nominal string variables were encoded as > non-binary integers {1, 2, 3, 4, …}, which may work (i.e., the code executes > without error) but is not the optimal way to do it. > > Best, > Sebastian > >> On Sep 30, 2015, at 7:54 PM, Jacob Vanderplas <jake...@cs.washington.edu> >> wrote: >> >> Hi, >> The problem with including data munging in the tutorial is that it's not >> really a machine learning question. Solutions are generally so >> domain-specific that you can't present it in a way that would be generally >> useful to an interdisciplinary audience. This is why most (all?) short >> machine learning tutorials ignore the data cleaning aspect and instead focus >> on the machine learning algorithms & concepts – and in my tutorials, I >> always try to emphasize the fact that I'm leaving this part up to the user >> (and perhaps point them to the pandas tutorial, if one is being offered). >> Jake >> >> Jake VanderPlas >> Senior Data Science Fellow >> Director of Research in Physical Sciences >> University of Washington eScience Institute >> >> On Wed, Sep 30, 2015 at 4:41 PM, KAB <kha...@yahoo.com> wrote: >> Hello Jake and Andy, >> >> If you would not mind some advice, I would suggest including examples (or at >> least one) where you use data that is not built-in. I remember the first >> several tutorials (if not all of them) relied completely on built-in data >> sets and unapologetically ignored the big elephant in the room that people >> will need to import/read-in their own data and have to deal with it in >> scikit-learn one way or another, either through pandas or numpy and these >> will then hand the data over to the appropriate scikit-learn routines. >> >> Ignoring coverage of this aspect (and likewise the issue of how to deal with >> categorical data in data sets), in such tutorials, in my humble opinion >> presents a somewhat uneasy hurdle to getting started with the scikit-learn >> tool set. I for one had to use R just to overcome these issues when I first >> started with this, even though I would have preferred to use Python and its >> data science stack due to my experience with and preference of Python over R. >> >> Best regards >> >> >> >> On 9/30/2015 8:22 PM, Andy wrote: >>> Hi Jake. >>> I think the tutorial Kyle and I did based on the previous tutorials was >>> working quite well. >>> I think it would make sense to work of our scipy ones and improve them >>> further. >>> I'd be happy to work on it. >>> We have some more exercises in a branch, and I have also improved versions >>> of some of the notebooks that I have been using for teaching. >>> >>> Andy >>> >>> >>> On 09/29/2015 06:48 PM, Jacob Vanderplas wrote: >>>> Hi All, >>>> PyCon 2016 call for proposals just opened. For the last several years >>>> Olivier and I have been teaching a two-part scikit-learn tutorial at each >>>> PyCon, and I think they have gone over well. >>>> >>>> As the conference is just a few hour train ride away for me this year, I'm >>>> certainly going to attend again. I'd also love to put together one or more >>>> scikit-learn tutorials again this year – if you're planning to attend >>>> PyCon and would like to work together on a proposal or two, let me know! >>>> Jake >>>> >>>> Jake VanderPlas >>>> Senior Data Science Fellow >>>> Director of Research in Physical Sciences >>>> University of Washington eScience Institute >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> >>>> >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> >>>> Scikit-learn-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general