Hello all, Long time lurker, first time emailer.
I have two small contributions I would like to propose to the email list. I was working on a project this weekend that was using both categorical and numerical columns to predict a final output. I needed to save my transformations to make future predictions and grid search over multiple models and parameters, so sklearn pipelines were the obvious answer. I setup a pipeline, grid searched, then pickled the best model to use for future predictions. This worked well, but I ran into two issues. *1). I needed a transformer to select individual columns in my pipeline. *I needed to apply unique transformations to each column in my data, then recombine with a FeatureUnion. I realized there is not a supported transformer to extract a specific column within pipelines. See this issue here as an example <https://stackoverflow.com/questions/39001956/sklearn-pipeline-how-to-apply-different-transformations-on-different-columns?rq=1>. I created a transformation that explicitly extracts columns of interest for use in a pipeline with FeatureUnion. A FunctionTransformer will solve this issue, but I feel as if sklearn should directly and explicitly support this functionality. I believe this will make pipelines significantly more intuitive and accessible for most users. *2). One hot encoding requires arrays that are already integers.* You can find a similar issue here <https://stackoverflow.com/questions/40456867/labelbinarizer-for-multiple-columns-in-data-frame>. This can be accomplished using Pandas.get_dummies() (where the transformation cannot be saved to apply to future predictions) or by using a scikit-learn LabelBinarizer <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html> transformation. LabelBinarizer is designed to transform y and does not have a method to pass x and y in a pipeline. This breaks scikit-learn pipelines. I built a LabelBinarizer transformation that can be used with FeatureUnion in pipelines. This issue may be moot with the new CategoricalEncoder <http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.CategoricalEncoder.html> that is about to be released. Does the community believe I should pursue contributing either of these? -- Cheers, DJ
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn