I agree that data munging is not strictly speaking a machine learning question, i.e. from the mathematics or computational point of view. But there is no denying the fact that most time doing machine learning is actually spent on data munging. So surely dealing with data has something to do with machine learning even if not algorithmically speaking. After all, if people don't know how to at least import external data sets to work on, and only know how to deal with built-in very clean data sets, then how is it expected that they might appreciate the tool set or use it in real life situations?
I don't think pointing people to the Pandas manual is really enough. It was not so for me, as I already knew pandas. And this is due to the special way scikit-learn requires the data to be presented to its objects. Last time I checked (I really don't know if there has been any change since then) one had to do some wrangling with pandas' data frames, however subtle that might be, to get scikit-learn to understand them. And there was quite an effort to be done regarding how to encode categorical factors and how to represent them in a fashion that scikit-learn understands. Of course it is your call what to do, what to include and what to ignore. I do think, however, it would be great if at least one simple and straight forward example of dealing with external data (some of it categorical) was included in the tutorial. That would surely be much appreciated by all, especially for those interested in the tutorials your esteemed persons would or might be presenting. Best regards On 9/30/2015 11:54 PM, Jacob Vanderplas wrote: > Hi, > The problem with including data munging in the tutorial is that it's > not really a machine learning question. Solutions are generally so > domain-specific that you can't present it in a way that would be > generally useful to an interdisciplinary audience. This is why most > (all?) short machine learning tutorials ignore the data cleaning > aspect and instead focus on the machine learning algorithms & concepts > – and in my tutorials, I always try to emphasize the fact that I'm > leaving this part up to the user (and perhaps point them to the pandas > tutorial, if one is being offered). > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Research in Physical Sciences > University of Washington eScience Institute > > On Wed, Sep 30, 2015 at 4:41 PM, KAB <kha...@yahoo.com > <mailto:kha...@yahoo.com>> wrote: > > Hello Jake and Andy, > > If you would not mind some advice, I would suggest including > examples (or at least one) where you use data that is not > built-in. I remember the first several tutorials (if not all of > them) relied completely on built-in data sets and unapologetically > ignored the big elephant in the room that people will need to > import/read-in their own data and have to deal with it in > scikit-learn one way or another, either through pandas or numpy > and these will then hand the data over to the appropriate > scikit-learn routines. > > Ignoring coverage of this aspect (and likewise the issue of how to > deal with categorical data in data sets), in such tutorials, in my > humble opinion presents a somewhat uneasy hurdle to getting > started with the scikit-learn tool set. I for one had to use R > just to overcome these issues when I first started with this, even > though I would have preferred to use Python and its data science > stack due to my experience with and preference of Python over R. > > Best regards > > > > On 9/30/2015 8:22 PM, Andy wrote: >> Hi Jake. >> I think the tutorial Kyle and I did based on the previous >> tutorials was working quite well. >> I think it would make sense to work of our scipy ones and improve >> them further. >> I'd be happy to work on it. >> We have some more exercises in a branch, and I have also improved >> versions of some of the notebooks that I have been using for >> teaching. >> >> Andy >> >> >> On 09/29/2015 06:48 PM, Jacob Vanderplas wrote: >>> Hi All, >>> PyCon 2016 call for proposals >>> <https://us.pycon.org/2016/speaking/tutorials/> just opened. For >>> the last several years Olivier and I have been teaching a >>> two-part scikit-learn tutorial at each PyCon, and I think they >>> have gone over well. >>> >>> As the conference is just a few hour train ride away for me this >>> year, I'm certainly going to attend again. I'd also love to put >>> together one or more scikit-learn tutorials again this year – if >>> you're planning to attend PyCon and would like to work together >>> on a proposal or two, let me know! >>> Jake >>> >>> Jake VanderPlas >>> Senior Data Science Fellow >>> Director of Research in Physical Sciences >>> University of Washington eScience Institute >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> <mailto:Scikit-learn-general@lists.sourceforge.net> >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> <mailto:Scikit-learn-general@lists.sourceforge.net> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > <mailto:Scikit-learn-general@lists.sourceforge.net> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general