I totally agree with Jake. However, I also think that a few general tutorials 
on “preprocessing” of “clean” datasets (clean in terms of missing values, 
duplicates, outliers have been dealt with) could be useful to a broader, 
interdisciplinary audience. For example:

- encoding class labels, encoding nominal vs ordinal feature variables 
- feature scaling and explaining when it matters (convex optimization vs 
tree-based algos etc.)
- partial_fit & dimensionality reduction for data compression if data is too 
large for a typical desktop machine and estimators that don’t support 
partial_fit; also talking about partial_fit of the dim reduction transformers

These are actually very important topics, and I noticed that they typically 
fall a little bit short in the general ML tutorials; typically, because these 
tutorials work with a single, specific dataset. Unfortunately, I have seen a 
couple of applications where nominal string variables were encoded as 
non-binary integers {1, 2, 3, 4, …}, which may work (i.e., the code executes 
without error) but is not the optimal way to do it.

Best,
Sebastian

> On Sep 30, 2015, at 7:54 PM, Jacob Vanderplas <jake...@cs.washington.edu> 
> wrote:
> 
> Hi,
> The problem with including data munging in the tutorial is that it's not 
> really a machine learning question. Solutions are generally so 
> domain-specific that you can't present it in a way that would be generally 
> useful to an interdisciplinary audience. This is why most (all?) short 
> machine learning tutorials ignore the data cleaning aspect and instead focus 
> on the machine learning algorithms & concepts – and in my tutorials, I always 
> try to emphasize the fact that I'm leaving this part up to the user (and 
> perhaps point them to the pandas tutorial, if one is being offered).
>    Jake
> 
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Research in Physical Sciences
>  University of Washington eScience Institute
> 
> On Wed, Sep 30, 2015 at 4:41 PM, KAB <kha...@yahoo.com> wrote:
> Hello Jake and Andy,
> 
> If you would not mind some advice, I would suggest including examples (or at 
> least one) where you use data that is not built-in. I remember the first 
> several tutorials (if not all of them) relied completely on built-in data 
> sets and unapologetically ignored the big elephant in the room that people 
> will need to import/read-in their own data and have to deal with it in 
> scikit-learn one way or another, either through pandas or numpy and these 
> will then hand the data over to the appropriate scikit-learn routines. 
> 
> Ignoring coverage of this aspect (and likewise the issue of how to deal with 
> categorical data in data sets), in such tutorials, in my humble opinion 
> presents a somewhat uneasy hurdle to getting started with the scikit-learn 
> tool set. I for one had to use R just to overcome these issues when I first 
> started with this, even though I would have preferred to use Python and its 
> data science stack due to my experience with and preference of Python over R.
> 
> Best regards
> 
> 
> 
> On 9/30/2015 8:22 PM, Andy wrote:
>> Hi Jake.
>> I think the tutorial Kyle and I did based on the previous tutorials was 
>> working quite well.
>> I think it would make sense to work of our scipy ones and improve them 
>> further.
>> I'd be happy to work on it.
>> We have some more exercises in a branch, and I have also improved versions 
>> of some of the notebooks that I have been using for teaching.
>> 
>> Andy
>> 
>> 
>> On 09/29/2015 06:48 PM, Jacob Vanderplas wrote:
>>> Hi All,
>>> PyCon 2016 call for proposals just opened. For the last several years 
>>> Olivier and I have been teaching a two-part scikit-learn tutorial at each 
>>> PyCon, and I think they have gone over well.
>>> 
>>> As the conference is just a few hour train ride away for me this year, I'm 
>>> certainly going to attend again. I'd also love to put together one or more 
>>> scikit-learn tutorials again this year – if you're planning to attend PyCon 
>>> and would like to work together on a proposal or two, let me know!
>>>    Jake
>>> 
>>>  Jake VanderPlas
>>>  Senior Data Science Fellow
>>>  Director of Research in Physical Sciences
>>>  University of Washington eScience Institute
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> 
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> 
>> 
>> 
>> _______________________________________________
>> Scikit-learn-general mailing list
>> 
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> 
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to