I totally agree with Jake. However, I also think that a few general tutorials
on “preprocessing” of “clean” datasets (clean in terms of missing values,
duplicates, outliers have been dealt with) could be useful to a broader,
interdisciplinary audience. For example:
- encoding class labels, encoding nominal vs ordinal feature variables
- feature scaling and explaining when it matters (convex optimization vs
tree-based algos etc.)
- partial_fit & dimensionality reduction for data compression if data is too
large for a typical desktop machine and estimators that don’t support
partial_fit; also talking about partial_fit of the dim reduction transformers
These are actually very important topics, and I noticed that they typically
fall a little bit short in the general ML tutorials; typically, because these
tutorials work with a single, specific dataset. Unfortunately, I have seen a
couple of applications where nominal string variables were encoded as
non-binary integers {1, 2, 3, 4, …}, which may work (i.e., the code executes
without error) but is not the optimal way to do it.
Best,
Sebastian
> On Sep 30, 2015, at 7:54 PM, Jacob Vanderplas <[email protected]>
> wrote:
>
> Hi,
> The problem with including data munging in the tutorial is that it's not
> really a machine learning question. Solutions are generally so
> domain-specific that you can't present it in a way that would be generally
> useful to an interdisciplinary audience. This is why most (all?) short
> machine learning tutorials ignore the data cleaning aspect and instead focus
> on the machine learning algorithms & concepts – and in my tutorials, I always
> try to emphasize the fact that I'm leaving this part up to the user (and
> perhaps point them to the pandas tutorial, if one is being offered).
> Jake
>
> Jake VanderPlas
> Senior Data Science Fellow
> Director of Research in Physical Sciences
> University of Washington eScience Institute
>
> On Wed, Sep 30, 2015 at 4:41 PM, KAB <[email protected]> wrote:
> Hello Jake and Andy,
>
> If you would not mind some advice, I would suggest including examples (or at
> least one) where you use data that is not built-in. I remember the first
> several tutorials (if not all of them) relied completely on built-in data
> sets and unapologetically ignored the big elephant in the room that people
> will need to import/read-in their own data and have to deal with it in
> scikit-learn one way or another, either through pandas or numpy and these
> will then hand the data over to the appropriate scikit-learn routines.
>
> Ignoring coverage of this aspect (and likewise the issue of how to deal with
> categorical data in data sets), in such tutorials, in my humble opinion
> presents a somewhat uneasy hurdle to getting started with the scikit-learn
> tool set. I for one had to use R just to overcome these issues when I first
> started with this, even though I would have preferred to use Python and its
> data science stack due to my experience with and preference of Python over R.
>
> Best regards
>
>
>
> On 9/30/2015 8:22 PM, Andy wrote:
>> Hi Jake.
>> I think the tutorial Kyle and I did based on the previous tutorials was
>> working quite well.
>> I think it would make sense to work of our scipy ones and improve them
>> further.
>> I'd be happy to work on it.
>> We have some more exercises in a branch, and I have also improved versions
>> of some of the notebooks that I have been using for teaching.
>>
>> Andy
>>
>>
>> On 09/29/2015 06:48 PM, Jacob Vanderplas wrote:
>>> Hi All,
>>> PyCon 2016 call for proposals just opened. For the last several years
>>> Olivier and I have been teaching a two-part scikit-learn tutorial at each
>>> PyCon, and I think they have gone over well.
>>>
>>> As the conference is just a few hour train ride away for me this year, I'm
>>> certainly going to attend again. I'd also love to put together one or more
>>> scikit-learn tutorials again this year – if you're planning to attend PyCon
>>> and would like to work together on a proposal or two, let me know!
>>> Jake
>>>
>>> Jake VanderPlas
>>> Senior Data Science Fellow
>>> Director of Research in Physical Sciences
>>> University of Washington eScience Institute
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>>
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>>
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>>
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general