If people are planning to work on this, it would be good to check what
Andy and I presented at SciPy, which is based on what Jake and Olivier
did at PyCon (and what Andy, Jake and Gael did at SciPy 2013, etc.
etc.).

To Sebastian's points - we covered all of these nearly verbatim except
perhaps class imbalance (maybe in the spam example? Don't recall
explicitly covering this, though it was requested in the course
feedback). We also covered "out of sklearn" data fairly extensively
loading from CSV and preprocessing several datasets. See specifically
the case studies here
https://github.com/amueller/scipy_2015_sklearn_tutorial/tree/master/notebooks
.

An issue is that really big dataset downloads tend to melt conference
wifi, and even sending many reminder emails to clone and download all
the data will only slightly reduce the number of all at once
downloads, so there is a balance between "interesting" and "large" at
play. Personally, I think it would be nice to show an image example
using skimage, and maybe something more esoteric but that might be an
issue since the tutorial is already quite crowded.

One of the key things I noticed is *time* is a huge issue - even at
SciPy which has a slightly more technical base level than PyCon, we
ran out of time to cover these topics to the appropriate depth (with 2
4hr sessions!). Covering things sufficiently well for introductory
students while also providing enough tips for people doing this in
practice is hard, and the breadth of experience at PyCon will probably
be even more difficult to cover than SciPy. Something to be aware of,
at least.

Andy has also been working on some presentations/courses for the book
which might be useful here, though I don't know what state they are in
at the current moment.

As Andy said, we have some solutions in a branch and I would be glad
to help get this set up. I don't know whether I will be there or not
just yet, but and hope to attend and could maybe teach if more hands
are needed and my schedule works out.

On Wed, Sep 30, 2015 at 8:58 PM, Chris Waigl <cwa...@alaska.edu> wrote:
> I believe a “Data cleaning and preprocessing for data science”
> (insert-snappier-title-here) tutorial would be a great addition to a PyCon.
> It’s a prerequisite for machine learning, that’s sure. A machine learning
> tutorial should probably not completely sweep it under the carpet, but treat
> it in the briefest ways at last until we have a place / set of resources to
> point people to. This is still an under-served area.
>
> Best,
>
> Chris
>
> --
> Christine (Chris) Waigl - cwa...@alaska.edu -  +1-907-474-5483 - Skype:
> cwaigl_work
> Geophysical Institute, UAF, 903 Koyukuk Drive, Fairbanks, AK 99775-7320, USA
>
>
>
>
>
>
>
> On Sep 30, 2015, at 4:39 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote:
>
> I totally agree with Jake. However, I also think that a few general
> tutorials on “preprocessing” of “clean” datasets (clean in terms of missing
> values, duplicates, outliers have been dealt with) could be useful to a
> broader, interdisciplinary audience. For example:
>
> - encoding class labels, encoding nominal vs ordinal feature variables
> - feature scaling and explaining when it matters (convex optimization vs
> tree-based algos etc.)
> - partial_fit & dimensionality reduction for data compression if data is too
> large for a typical desktop machine and estimators that don’t support
> partial_fit; also talking about partial_fit of the dim reduction
> transformers
>
> These are actually very important topics, and I noticed that they typically
> fall a little bit short in the general ML tutorials; typically, because
> these tutorials work with a single, specific dataset. Unfortunately, I have
> seen a couple of applications where nominal string variables were encoded as
> non-binary integers {1, 2, 3, 4, …}, which may work (i.e., the code executes
> without error) but is not the optimal way to do it.
>
> Best,
> Sebastian
>
> On Sep 30, 2015, at 7:54 PM, Jacob Vanderplas <jake...@cs.washington.edu>
> wrote:
>
> Hi,
> The problem with including data munging in the tutorial is that it's not
> really a machine learning question. Solutions are generally so
> domain-specific that you can't present it in a way that would be generally
> useful to an interdisciplinary audience. This is why most (all?) short
> machine learning tutorials ignore the data cleaning aspect and instead focus
> on the machine learning algorithms & concepts – and in my tutorials, I
> always try to emphasize the fact that I'm leaving this part up to the user
> (and perhaps point them to the pandas tutorial, if one is being offered).
>   Jake
>
> Jake VanderPlas
> Senior Data Science Fellow
> Director of Research in Physical Sciences
> University of Washington eScience Institute
>
> On Wed, Sep 30, 2015 at 4:41 PM, KAB <kha...@yahoo.com> wrote:
> Hello Jake and Andy,
>
> If you would not mind some advice, I would suggest including examples (or at
> least one) where you use data that is not built-in. I remember the first
> several tutorials (if not all of them) relied completely on built-in data
> sets and unapologetically ignored the big elephant in the room that people
> will need to import/read-in their own data and have to deal with it in
> scikit-learn one way or another, either through pandas or numpy and these
> will then hand the data over to the appropriate scikit-learn routines.
>
> Ignoring coverage of this aspect (and likewise the issue of how to deal with
> categorical data in data sets), in such tutorials, in my humble opinion
> presents a somewhat uneasy hurdle to getting started with the scikit-learn
> tool set. I for one had to use R just to overcome these issues when I first
> started with this, even though I would have preferred to use Python and its
> data science stack due to my experience with and preference of Python over
> R.
>
> Best regards
>
>
>
> On 9/30/2015 8:22 PM, Andy wrote:
>
> Hi Jake.
> I think the tutorial Kyle and I did based on the previous tutorials was
> working quite well.
> I think it would make sense to work of our scipy ones and improve them
> further.
> I'd be happy to work on it.
> We have some more exercises in a branch, and I have also improved versions
> of some of the notebooks that I have been using for teaching.
>
> Andy
>
>
> On 09/29/2015 06:48 PM, Jacob Vanderplas wrote:
>
> Hi All,
> PyCon 2016 call for proposals just opened. For the last several years
> Olivier and I have been teaching a two-part scikit-learn tutorial at each
> PyCon, and I think they have gone over well.
>
> As the conference is just a few hour train ride away for me this year, I'm
> certainly going to attend again. I'd also love to put together one or more
> scikit-learn tutorials again this year – if you're planning to attend PyCon
> and would like to work together on a proposal or two, let me know!
>   Jake
>
> Jake VanderPlas
> Senior Data Science Fellow
> Director of Research in Physical Sciences
> University of Washington eScience Institute
>
>
> ------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Scikit-learn-general mailing list
>
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Scikit-learn-general mailing list
>
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to