I agree that data munging is not strictly speaking a machine learning
question, i.e. from the mathematics or computational point of view. But
there is no denying the fact that most time doing machine learning is
actually spent on data munging. So surely dealing with data has
something to do with machine learning even if not algorithmically
speaking. After all, if people don't know how to at least import
external data sets to work on, and only know how to deal with built-in
very clean data sets, then how is it expected that they might appreciate
the tool set or use it in real life situations?

I don't think pointing people to the Pandas manual is really enough. It
was not so for me, as I already knew pandas. And this is due to the
special way scikit-learn requires the data to be presented to its
objects. Last time I checked (I really don't know if there has been any
change since then) one had to do some wrangling with pandas' data
frames, however subtle that might be, to get scikit-learn to understand
them. And there was quite an effort to be done regarding how to encode
categorical factors and how to represent them in a fashion that
scikit-learn understands.

Of course it is your call what to do, what to include and what to
ignore. I do think, however, it would be great if at least one simple
and straight forward example of dealing with external data (some of it
categorical) was included in the tutorial. That would surely be much
appreciated by all, especially for those interested in the tutorials
your esteemed persons would or might be presenting.

Best regards


On 9/30/2015 11:54 PM, Jacob Vanderplas wrote:
> Hi,
> The problem with including data munging in the tutorial is that it's
> not really a machine learning question. Solutions are generally so
> domain-specific that you can't present it in a way that would be
> generally useful to an interdisciplinary audience. This is why most
> (all?) short machine learning tutorials ignore the data cleaning
> aspect and instead focus on the machine learning algorithms & concepts
> – and in my tutorials, I always try to emphasize the fact that I'm
> leaving this part up to the user (and perhaps point them to the pandas
> tutorial, if one is being offered).
>    Jake
>
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Research in Physical Sciences
>  University of Washington eScience Institute
>
> On Wed, Sep 30, 2015 at 4:41 PM, KAB <kha...@yahoo.com
> <mailto:kha...@yahoo.com>> wrote:
>
>     Hello Jake and Andy,
>
>     If you would not mind some advice, I would suggest including
>     examples (or at least one) where you use data that is not
>     built-in. I remember the first several tutorials (if not all of
>     them) relied completely on built-in data sets and unapologetically
>     ignored the big elephant in the room that people will need to
>     import/read-in their own data and have to deal with it in
>     scikit-learn one way or another, either through pandas or numpy
>     and these will then hand the data over to the appropriate
>     scikit-learn routines.
>
>     Ignoring coverage of this aspect (and likewise the issue of how to
>     deal with categorical data in data sets), in such tutorials, in my
>     humble opinion presents a somewhat uneasy hurdle to getting
>     started with the scikit-learn tool set. I for one had to use R
>     just to overcome these issues when I first started with this, even
>     though I would have preferred to use Python and its data science
>     stack due to my experience with and preference of Python over R.
>
>     Best regards
>
>
>
>     On 9/30/2015 8:22 PM, Andy wrote:
>>     Hi Jake.
>>     I think the tutorial Kyle and I did based on the previous
>>     tutorials was working quite well.
>>     I think it would make sense to work of our scipy ones and improve
>>     them further.
>>     I'd be happy to work on it.
>>     We have some more exercises in a branch, and I have also improved
>>     versions of some of the notebooks that I have been using for
>>     teaching.
>>
>>     Andy
>>
>>
>>     On 09/29/2015 06:48 PM, Jacob Vanderplas wrote:
>>>     Hi All,
>>>     PyCon 2016 call for proposals
>>>     <https://us.pycon.org/2016/speaking/tutorials/> just opened. For
>>>     the last several years Olivier and I have been teaching a
>>>     two-part scikit-learn tutorial at each PyCon, and I think they
>>>     have gone over well.
>>>
>>>     As the conference is just a few hour train ride away for me this
>>>     year, I'm certainly going to attend again. I'd also love to put
>>>     together one or more scikit-learn tutorials again this year – if
>>>     you're planning to attend PyCon and would like to work together
>>>     on a proposal or two, let me know!
>>>        Jake
>>>
>>>      Jake VanderPlas
>>>      Senior Data Science Fellow
>>>      Director of Research in Physical Sciences
>>>      University of Washington eScience Institute
>>>
>>>
>>>     
>>> ------------------------------------------------------------------------------
>>>
>>>
>>>     _______________________________________________
>>>     Scikit-learn-general mailing list
>>>     Scikit-learn-general@lists.sourceforge.net
>>>     <mailto:Scikit-learn-general@lists.sourceforge.net>
>>>     https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>     
>> ------------------------------------------------------------------------------
>>
>>
>>     _______________________________________________
>>     Scikit-learn-general mailing list
>>     Scikit-learn-general@lists.sourceforge.net
>>     <mailto:Scikit-learn-general@lists.sourceforge.net>
>>     https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>     
> ------------------------------------------------------------------------------
>
>     _______________________________________________
>     Scikit-learn-general mailing list
>     Scikit-learn-general@lists.sourceforge.net
>     <mailto:Scikit-learn-general@lists.sourceforge.net>
>     https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to