Hi all,
Interesting discussion – thanks for the thoughts and ideas! I like the idea
of a Data Munging tutorial being separate from an ML tutorial. With only 3
hours, that seems more doable than trying to squeeze it all into one
session. Perhaps Stephan's idea of an image-focused ML tutorial could be a
kind of case study in domain-specific data munging.

I'd echo Kyle's comments on time being a huge constraint. I've given some
version of my basic scikit-learn tutorial probably two dozen times now, and
I find that 3 hours is basically enough to cover data format (i.e.
n_samples x n_features), the estimator API, the basics of supervised vs
unsupervised learning, and a brief deeper dive into two or three estimators
(I usually cover SVMs, Random Forests, and K-Means). That might not seem
like a lot of material, but for an intro audience who asks a lot of
questions, it can easily fill the time.

This weekend Stephan, Andy, and I will be at a meeting together near
Seattle. Perhaps the three of us can chat a bit in person and see how we
might tap all of this enthusiasm & incorporate the ideas presented here. I
think they will go over well: having been on the PyCon tutorial committee
for the past several years, I know that there has been demand for increased
data-science-related tutorial topics. I can also tell you that if we
present a "united front", so to speak, with some thought put in to the flow
between multiple related tutorials, we'll have a better chance of getting
them accepted (though keep in mind that if submissions are similar to the
last few years, only about 1 in 4 proposals will be accepted overall!)

Thanks,
   Jake

 Jake VanderPlas
 Senior Data Science Fellow
 Director of Research in Physical Sciences
 University of Washington eScience Institute

On Wed, Sep 30, 2015 at 7:06 PM, Kyle Kastner <kastnerk...@gmail.com> wrote:

> If people are planning to work on this, it would be good to check what
> Andy and I presented at SciPy, which is based on what Jake and Olivier
> did at PyCon (and what Andy, Jake and Gael did at SciPy 2013, etc.
> etc.).
>
> To Sebastian's points - we covered all of these nearly verbatim except
> perhaps class imbalance (maybe in the spam example? Don't recall
> explicitly covering this, though it was requested in the course
> feedback). We also covered "out of sklearn" data fairly extensively
> loading from CSV and preprocessing several datasets. See specifically
> the case studies here
>
> https://github.com/amueller/scipy_2015_sklearn_tutorial/tree/master/notebooks
> .
>
> An issue is that really big dataset downloads tend to melt conference
> wifi, and even sending many reminder emails to clone and download all
> the data will only slightly reduce the number of all at once
> downloads, so there is a balance between "interesting" and "large" at
> play. Personally, I think it would be nice to show an image example
> using skimage, and maybe something more esoteric but that might be an
> issue since the tutorial is already quite crowded.
>
> One of the key things I noticed is *time* is a huge issue - even at
> SciPy which has a slightly more technical base level than PyCon, we
> ran out of time to cover these topics to the appropriate depth (with 2
> 4hr sessions!). Covering things sufficiently well for introductory
> students while also providing enough tips for people doing this in
> practice is hard, and the breadth of experience at PyCon will probably
> be even more difficult to cover than SciPy. Something to be aware of,
> at least.
>
> Andy has also been working on some presentations/courses for the book
> which might be useful here, though I don't know what state they are in
> at the current moment.
>
> As Andy said, we have some solutions in a branch and I would be glad
> to help get this set up. I don't know whether I will be there or not
> just yet, but and hope to attend and could maybe teach if more hands
> are needed and my schedule works out.
>
> On Wed, Sep 30, 2015 at 8:58 PM, Chris Waigl <cwa...@alaska.edu> wrote:
> > I believe a “Data cleaning and preprocessing for data science”
> > (insert-snappier-title-here) tutorial would be a great addition to a
> PyCon.
> > It’s a prerequisite for machine learning, that’s sure. A machine learning
> > tutorial should probably not completely sweep it under the carpet, but
> treat
> > it in the briefest ways at last until we have a place / set of resources
> to
> > point people to. This is still an under-served area.
> >
> > Best,
> >
> > Chris
> >
> > --
> > Christine (Chris) Waigl - cwa...@alaska.edu -  +1-907-474-5483 - Skype:
> > cwaigl_work
> > Geophysical Institute, UAF, 903 Koyukuk Drive, Fairbanks, AK 99775-7320,
> USA
> >
> >
> >
> >
> >
> >
> >
> > On Sep 30, 2015, at 4:39 PM, Sebastian Raschka <se.rasc...@gmail.com>
> wrote:
> >
> > I totally agree with Jake. However, I also think that a few general
> > tutorials on “preprocessing” of “clean” datasets (clean in terms of
> missing
> > values, duplicates, outliers have been dealt with) could be useful to a
> > broader, interdisciplinary audience. For example:
> >
> > - encoding class labels, encoding nominal vs ordinal feature variables
> > - feature scaling and explaining when it matters (convex optimization vs
> > tree-based algos etc.)
> > - partial_fit & dimensionality reduction for data compression if data is
> too
> > large for a typical desktop machine and estimators that don’t support
> > partial_fit; also talking about partial_fit of the dim reduction
> > transformers
> >
> > These are actually very important topics, and I noticed that they
> typically
> > fall a little bit short in the general ML tutorials; typically, because
> > these tutorials work with a single, specific dataset. Unfortunately, I
> have
> > seen a couple of applications where nominal string variables were
> encoded as
> > non-binary integers {1, 2, 3, 4, …}, which may work (i.e., the code
> executes
> > without error) but is not the optimal way to do it.
> >
> > Best,
> > Sebastian
> >
> > On Sep 30, 2015, at 7:54 PM, Jacob Vanderplas <jake...@cs.washington.edu
> >
> > wrote:
> >
> > Hi,
> > The problem with including data munging in the tutorial is that it's not
> > really a machine learning question. Solutions are generally so
> > domain-specific that you can't present it in a way that would be
> generally
> > useful to an interdisciplinary audience. This is why most (all?) short
> > machine learning tutorials ignore the data cleaning aspect and instead
> focus
> > on the machine learning algorithms & concepts – and in my tutorials, I
> > always try to emphasize the fact that I'm leaving this part up to the
> user
> > (and perhaps point them to the pandas tutorial, if one is being offered).
> >   Jake
> >
> > Jake VanderPlas
> > Senior Data Science Fellow
> > Director of Research in Physical Sciences
> > University of Washington eScience Institute
> >
> > On Wed, Sep 30, 2015 at 4:41 PM, KAB <kha...@yahoo.com> wrote:
> > Hello Jake and Andy,
> >
> > If you would not mind some advice, I would suggest including examples
> (or at
> > least one) where you use data that is not built-in. I remember the first
> > several tutorials (if not all of them) relied completely on built-in data
> > sets and unapologetically ignored the big elephant in the room that
> people
> > will need to import/read-in their own data and have to deal with it in
> > scikit-learn one way or another, either through pandas or numpy and these
> > will then hand the data over to the appropriate scikit-learn routines.
> >
> > Ignoring coverage of this aspect (and likewise the issue of how to deal
> with
> > categorical data in data sets), in such tutorials, in my humble opinion
> > presents a somewhat uneasy hurdle to getting started with the
> scikit-learn
> > tool set. I for one had to use R just to overcome these issues when I
> first
> > started with this, even though I would have preferred to use Python and
> its
> > data science stack due to my experience with and preference of Python
> over
> > R.
> >
> > Best regards
> >
> >
> >
> > On 9/30/2015 8:22 PM, Andy wrote:
> >
> > Hi Jake.
> > I think the tutorial Kyle and I did based on the previous tutorials was
> > working quite well.
> > I think it would make sense to work of our scipy ones and improve them
> > further.
> > I'd be happy to work on it.
> > We have some more exercises in a branch, and I have also improved
> versions
> > of some of the notebooks that I have been using for teaching.
> >
> > Andy
> >
> >
> > On 09/29/2015 06:48 PM, Jacob Vanderplas wrote:
> >
> > Hi All,
> > PyCon 2016 call for proposals just opened. For the last several years
> > Olivier and I have been teaching a two-part scikit-learn tutorial at each
> > PyCon, and I think they have gone over well.
> >
> > As the conference is just a few hour train ride away for me this year,
> I'm
> > certainly going to attend again. I'd also love to put together one or
> more
> > scikit-learn tutorials again this year – if you're planning to attend
> PyCon
> > and would like to work together on a proposal or two, let me know!
> >   Jake
> >
> > Jake VanderPlas
> > Senior Data Science Fellow
> > Director of Research in Physical Sciences
> > University of Washington eScience Institute
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> >
> >
> > _______________________________________________
> > Scikit-learn-general mailing list
> >
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> >
> >
> > _______________________________________________
> > Scikit-learn-general mailing list
> >
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> ------------------------------------------------------------------------------
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to