Ah, I should say that the dealing with "too large" data should be referred to Andreas' tutorial at PyData 2015 (I think that's where I saw it) and the scikit-learn website. I don't see any reason to repeat it at PyData or PyCon. If necessary, I would make it a separate tutorial or talk, or mention it in passing as a reference to previous presentations.
I'd like to see a tutorial on uses of Recursive Feature Elimination and Bayesian optimization for hyperparameter search, including a comparison to grid search and randomized grid search. A panel discussion worth having is whether there are any emerging best practices in machine learning, statistics, etc. Since conferences have parallel tracks, it's possible that participants can't get to all talks of interest. Having these talks linked off the scikit-learn web page would be quite valuable, I think. Dale Smith, Ph.D. Data Scientist d. 404.495.7220 x 4008 f. 404.795.7221 Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 30305 -----Original Message----- From: Dale Smith [mailto:dsm...@nexidia.com] Sent: Monday, October 05, 2015 8:35 AM To: scikit-learn-general@lists.sourceforge.net Subject: Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial Re “The Ins and Outs of a Machine Learning Pipeline — About the data that you feed to a learning algorithm and how to analyze the results”. I'm referencing the proposed list of topics below. I think it's a valuable tutorial. I would break these apart, however, and introduce scaling: - when it's necessary - when it's not necessary Both these are algorithm-dependent. Additionally, there are other issues in scaling, such as applying the same scaling to the test, training, and evaluation data sets. Should these data sets be scaled individually, or should the entire data set be scaled before splitting? I've seen a recommendation that the scaling on the training data should be applied to the test and evaluation datasets, which may result in values that are not strictly [-1, 1]. Should the same scaling transformation be applied on data in the production system? Why or why not? I haven't seen any of these issues addressed at all, but they are important parts of properly applying machine learning. Dale Smith, Ph.D. Data Scientist d. 404.495.7220 x 4008 f. 404.795.7221 Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 30305 -----Original Message----- From: Sebastian Raschka [mailto:se.rasc...@gmail.com] Sent: Wednesday, September 30, 2015 9:13 PM To: scikit-learn-general@lists.sourceforge.net Subject: Re: [Scikit-learn-general] PyCon 2016 scikit-learn tutorial If there is interest, I could work on something like “The Ins and Outs of a Machine Learning Pipeline — About the data that you feed to a learning algorithm and how to analyze the results” covering the topics Part 1: - class label encoding - feature encoding - feature selection vs. dimensionality reduction for data compression - supervised vs. unsupervised dimensionality reduction - working with data that is "too large" (stochastic GD, partial_fit etc.) - dealing with class imbalances Part 2: - k-fold cross validation, nested cross-validation - performance metrics I think those are important topics one should know about, but would that be interesting/appropriate for a PyCon talk? What do you think? Best, Sebastian > On Sep 30, 2015, at 8:39 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote: > > I totally agree with Jake. However, I also think that a few general tutorials > on “preprocessing” of “clean” datasets (clean in terms of missing values, > duplicates, outliers have been dealt with) could be useful to a broader, > interdisciplinary audience. For example: > > - encoding class labels, encoding nominal vs ordinal feature variables > - feature scaling and explaining when it matters (convex optimization > vs tree-based algos etc.) > - partial_fit & dimensionality reduction for data compression if data > is too large for a typical desktop machine and estimators that don’t > support partial_fit; also talking about partial_fit of the dim > reduction transformers > > These are actually very important topics, and I noticed that they typically > fall a little bit short in the general ML tutorials; typically, because these > tutorials work with a single, specific dataset. Unfortunately, I have seen a > couple of applications where nominal string variables were encoded as > non-binary integers {1, 2, 3, 4, …}, which may work (i.e., the code executes > without error) but is not the optimal way to do it. > > Best, > Sebastian > >> On Sep 30, 2015, at 7:54 PM, Jacob Vanderplas <jake...@cs.washington.edu> >> wrote: >> >> Hi, >> The problem with including data munging in the tutorial is that it's not >> really a machine learning question. Solutions are generally so >> domain-specific that you can't present it in a way that would be generally >> useful to an interdisciplinary audience. This is why most (all?) short >> machine learning tutorials ignore the data cleaning aspect and instead focus >> on the machine learning algorithms & concepts – and in my tutorials, I >> always try to emphasize the fact that I'm leaving this part up to the user >> (and perhaps point them to the pandas tutorial, if one is being offered). >> Jake >> >> Jake VanderPlas >> Senior Data Science Fellow >> Director of Research in Physical Sciences University of Washington >> eScience Institute >> >> On Wed, Sep 30, 2015 at 4:41 PM, KAB <kha...@yahoo.com> wrote: >> Hello Jake and Andy, >> >> If you would not mind some advice, I would suggest including examples (or at >> least one) where you use data that is not built-in. I remember the first >> several tutorials (if not all of them) relied completely on built-in data >> sets and unapologetically ignored the big elephant in the room that people >> will need to import/read-in their own data and have to deal with it in >> scikit-learn one way or another, either through pandas or numpy and these >> will then hand the data over to the appropriate scikit-learn routines. >> >> Ignoring coverage of this aspect (and likewise the issue of how to deal with >> categorical data in data sets), in such tutorials, in my humble opinion >> presents a somewhat uneasy hurdle to getting started with the scikit-learn >> tool set. I for one had to use R just to overcome these issues when I first >> started with this, even though I would have preferred to use Python and its >> data science stack due to my experience with and preference of Python over R. >> >> Best regards >> >> >> >> On 9/30/2015 8:22 PM, Andy wrote: >>> Hi Jake. >>> I think the tutorial Kyle and I did based on the previous tutorials was >>> working quite well. >>> I think it would make sense to work of our scipy ones and improve them >>> further. >>> I'd be happy to work on it. >>> We have some more exercises in a branch, and I have also improved versions >>> of some of the notebooks that I have been using for teaching. >>> >>> Andy >>> >>> >>> On 09/29/2015 06:48 PM, Jacob Vanderplas wrote: >>>> Hi All, >>>> PyCon 2016 call for proposals just opened. For the last several years >>>> Olivier and I have been teaching a two-part scikit-learn tutorial at each >>>> PyCon, and I think they have gone over well. >>>> >>>> As the conference is just a few hour train ride away for me this year, I'm >>>> certainly going to attend again. I'd also love to put together one or more >>>> scikit-learn tutorials again this year – if you're planning to attend >>>> PyCon and would like to work together on a proposal or two, let me know! >>>> Jake >>>> >>>> Jake VanderPlas >>>> Senior Data Science Fellow >>>> Director of Research in Physical Sciences University of Washington >>>> eScience Institute >>>> >>>> >>>> ------------------------------------------------------------------- >>>> ----------- >>>> >>>> >>>> >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> >>>> Scikit-learn-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >>> >>> -------------------------------------------------------------------- >>> ---------- >>> >>> >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> --------------------------------------------------------------------- >> --------- >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> --------------------------------------------------------------------- >> --------- _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > ---------------------------------------------------------------------- > -------- _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general