[scikit-learn] New Transformer to Support Multiple Column Pipelines & One Hot Encoding

Dale Jacques Tue, 20 Feb 2018 10:14:28 -0800

Hello all,

Long time lurker, first time emailer.

I have two small contributions I would like to propose to the email list.

I was working on a project this weekend that was using both categorical and
numerical columns to predict a final output. I needed to save my
transformations to make future predictions and grid search over multiple
models and parameters, so sklearn pipelines were the obvious answer. I
setup a pipeline, grid searched, then pickled the best model to use for
future predictions.

This worked well, but I ran into two issues.
*1). I needed a transformer to select individual columns in my pipeline. *I
needed to apply unique transformations to each column in my data, then
recombine with a FeatureUnion. I realized there is not a supported
transformer to extract a specific column within pipelines. See this issue
here as an example
<https://stackoverflow.com/questions/39001956/sklearn-pipeline-how-to-apply-different-transformations-on-different-columns?rq=1>.
I created a transformation that explicitly extracts columns of interest for
use in a pipeline with FeatureUnion. A FunctionTransformer will solve this
issue, but I feel as if sklearn should directly and explicitly support this
functionality. I believe this will make pipelines significantly more
intuitive and accessible for most users.

*2). One hot encoding requires arrays that are already integers.* You can
find a similar issue here
<https://stackoverflow.com/questions/40456867/labelbinarizer-for-multiple-columns-in-data-frame>.
This can be accomplished using Pandas.get_dummies() (where the
transformation cannot be saved to apply to future predictions) or by using
a scikit-learn LabelBinarizer
<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html>
transformation. LabelBinarizer is designed to transform y and does not
have a method to pass x and y in a pipeline. This breaks scikit-learn
pipelines. I built a LabelBinarizer transformation that can be used with
FeatureUnion in pipelines. This issue may be moot with the new
CategoricalEncoder
<http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.CategoricalEncoder.html>
that is about to be released.

Does the community believe I should pursue contributing either of these?

--
Cheers,

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] New Transformer to Support Multiple Column Pipelines & One Hot Encoding

Reply via email to