Hi, Jason, like Andreas said, you really have to be careful with categorical features. I think the one-hot-encoder is more for nominal features though, I would handle ordinal ones differently:
E.g., if you have "sizes" like "M", "L", "S", "XL", I would encode them as ["M", "L", "S", "XL"] -> [2, 3, 1, 4] And e.g., colors via the one-hot-encoder ["green", "red", "blue"] -> [[0,0,1], [0,1,0], [1,0,0]] About the column selector: If it is useful, you can just make your own class class ColumnSelector(object): def __init__(self, cols): self.cols = cols def transform(self, X, y=None): return X[:, self.cols] def fit(self, X, y=None): return self and then e.g., use it in a pipeline or so: clf = Pipeline(steps=[ ('scl', StandardScaler()), ('sele', ColumnSelector(cols=(1,3))), # extracts column 2 and 4 ('clf', Logistic_regression()) ]) Best, Sebastian > On Mar 2, 2015, at 1:31 PM, Andy <t3k...@gmail.com> wrote: > > Hi Jason. > > We don't have any support for groups or types of features currently, sorry. > And you do need to convert all categorical features to one-hot encoded > features for use with sklearn. > The underlying issue is that we use numpy arrays as our main data structure, > and they are not very easy to annotate with feature types etc. > > Best, > Andreas > > On 03/02/2015 01:08 PM, Jason Wolosonovich wrote: >> Hello All, >> >> When using any of the preprocessing options in sklearn, is it possible to >> select a subset of features (columns) in a dataset for preprocessing? Many >> datasets contain a mix of feature types (categorical, numerical, binary) and >> it doesn’t seem like it would make sense to scale certain types of features >> (like binary and categorical), though I suppose if the information contained >> in them is not altered by the scaling, it may not hurt to have it scale the >> entire dataset regardless of feature type. Any thoughts on the subject >> welcome. Thanks! >> >> >> -Jason >> >> >> ------------------------------------------------------------------------------ >> Dive into the World of Parallel Programming The Go Parallel Website, >> sponsored >> by Intel and developed in partnership with Slashdot Media, is your hub for >> all >> things parallel software development, from weekly thought leadership blogs to >> news, videos, case studies, tutorials and more. Take a look and join the >> conversation now. >> http://goparallel.sourceforge.net/ >> >> >> _______________________________________________ >> Scikit-learn-general mailing list >> >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for all > things parallel software development, from weekly thought leadership blogs to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general