I believe a “Data cleaning and preprocessing for data science” 
(insert-snappier-title-here) tutorial would be a great addition to a PyCon. 
It’s a prerequisite for machine learning, that’s sure. A machine learning 
tutorial should probably not completely sweep it under the carpet, but treat it 
in the briefest ways at last until we have a place / set of resources to point 
people to. This is still an under-served area. 

Best,

Chris

-- 
Christine (Chris) Waigl - cwa...@alaska.edu -  +1-907-474-5483 - Skype: 
cwaigl_work
Geophysical Institute, UAF, 903 Koyukuk Drive, Fairbanks, AK 99775-7320, USA







> On Sep 30, 2015, at 4:39 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> 
> I totally agree with Jake. However, I also think that a few general tutorials 
> on “preprocessing” of “clean” datasets (clean in terms of missing values, 
> duplicates, outliers have been dealt with) could be useful to a broader, 
> interdisciplinary audience. For example:
> 
> - encoding class labels, encoding nominal vs ordinal feature variables 
> - feature scaling and explaining when it matters (convex optimization vs 
> tree-based algos etc.)
> - partial_fit & dimensionality reduction for data compression if data is too 
> large for a typical desktop machine and estimators that don’t support 
> partial_fit; also talking about partial_fit of the dim reduction transformers
> 
> These are actually very important topics, and I noticed that they typically 
> fall a little bit short in the general ML tutorials; typically, because these 
> tutorials work with a single, specific dataset. Unfortunately, I have seen a 
> couple of applications where nominal string variables were encoded as 
> non-binary integers {1, 2, 3, 4, …}, which may work (i.e., the code executes 
> without error) but is not the optimal way to do it.
> 
> Best,
> Sebastian
> 
>> On Sep 30, 2015, at 7:54 PM, Jacob Vanderplas <jake...@cs.washington.edu> 
>> wrote:
>> 
>> Hi,
>> The problem with including data munging in the tutorial is that it's not 
>> really a machine learning question. Solutions are generally so 
>> domain-specific that you can't present it in a way that would be generally 
>> useful to an interdisciplinary audience. This is why most (all?) short 
>> machine learning tutorials ignore the data cleaning aspect and instead focus 
>> on the machine learning algorithms & concepts – and in my tutorials, I 
>> always try to emphasize the fact that I'm leaving this part up to the user 
>> (and perhaps point them to the pandas tutorial, if one is being offered).
>>   Jake
>> 
>> Jake VanderPlas
>> Senior Data Science Fellow
>> Director of Research in Physical Sciences
>> University of Washington eScience Institute
>> 
>> On Wed, Sep 30, 2015 at 4:41 PM, KAB <kha...@yahoo.com> wrote:
>> Hello Jake and Andy,
>> 
>> If you would not mind some advice, I would suggest including examples (or at 
>> least one) where you use data that is not built-in. I remember the first 
>> several tutorials (if not all of them) relied completely on built-in data 
>> sets and unapologetically ignored the big elephant in the room that people 
>> will need to import/read-in their own data and have to deal with it in 
>> scikit-learn one way or another, either through pandas or numpy and these 
>> will then hand the data over to the appropriate scikit-learn routines. 
>> 
>> Ignoring coverage of this aspect (and likewise the issue of how to deal with 
>> categorical data in data sets), in such tutorials, in my humble opinion 
>> presents a somewhat uneasy hurdle to getting started with the scikit-learn 
>> tool set. I for one had to use R just to overcome these issues when I first 
>> started with this, even though I would have preferred to use Python and its 
>> data science stack due to my experience with and preference of Python over R.
>> 
>> Best regards
>> 
>> 
>> 
>> On 9/30/2015 8:22 PM, Andy wrote:
>>> Hi Jake.
>>> I think the tutorial Kyle and I did based on the previous tutorials was 
>>> working quite well.
>>> I think it would make sense to work of our scipy ones and improve them 
>>> further.
>>> I'd be happy to work on it.
>>> We have some more exercises in a branch, and I have also improved versions 
>>> of some of the notebooks that I have been using for teaching.
>>> 
>>> Andy
>>> 
>>> 
>>> On 09/29/2015 06:48 PM, Jacob Vanderplas wrote:
>>>> Hi All,
>>>> PyCon 2016 call for proposals just opened. For the last several years 
>>>> Olivier and I have been teaching a two-part scikit-learn tutorial at each 
>>>> PyCon, and I think they have gone over well.
>>>> 
>>>> As the conference is just a few hour train ride away for me this year, I'm 
>>>> certainly going to attend again. I'd also love to put together one or more 
>>>> scikit-learn tutorials again this year – if you're planning to attend 
>>>> PyCon and would like to work together on a proposal or two, let me know!
>>>>   Jake
>>>> 
>>>> Jake VanderPlas
>>>> Senior Data Science Fellow
>>>> Director of Research in Physical Sciences
>>>> University of Washington eScience Institute
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> 
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> 
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
>> ------------------------------------------------------------------------------
>> 
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to