Hi, Jason,

like Andreas said, you really have to be careful with categorical features. I 
think the one-hot-encoder is more for nominal features though, I would handle 
ordinal ones differently:

E.g., if you have "sizes" like "M", "L", "S", "XL", I would encode them as

["M", "L", "S", "XL"] -> [2, 3, 1, 4]

And e.g., colors via the one-hot-encoder

["green", "red", "blue"] -> [[0,0,1], [0,1,0], [1,0,0]]


About the column selector: If it is useful, you can just make your own class


class ColumnSelector(object):
  
    def __init__(self, cols):
        self.cols = cols
        
    def transform(self, X, y=None):
        return X[:, self.cols]

    def fit(self, X, y=None):
        return self

and then e.g., use it in a pipeline or so:

clf = Pipeline(steps=[
    ('scl', StandardScaler()),
    ('sele', ColumnSelector(cols=(1,3))),    # extracts column 2 and 4
    ('clf', Logistic_regression())   
    ]) 


Best,
Sebastian




> On Mar 2, 2015, at 1:31 PM, Andy <t3k...@gmail.com> wrote:
> 
> Hi Jason.
> 
> We don't have any support for groups or types of features currently, sorry.
> And you do need to convert all categorical features to one-hot encoded 
> features for use with sklearn.
> The underlying issue is that we use numpy arrays as our main data structure, 
> and they are not very easy to annotate with feature types etc.
> 
> Best,
> Andreas
> 
> On 03/02/2015 01:08 PM, Jason Wolosonovich wrote:
>> Hello All,
>>  
>> When using any of the preprocessing options in sklearn, is it possible to 
>> select a subset of features (columns) in a dataset for preprocessing? Many 
>> datasets contain a mix of feature types (categorical, numerical, binary) and 
>> it doesn’t seem like it would make sense to scale certain types of features 
>> (like binary and categorical), though I suppose if the information contained 
>> in them is not altered by the scaling, it may not hurt to have it scale the 
>> entire dataset regardless of feature type.  Any thoughts on the subject 
>> welcome. Thanks!
>>  
>>  
>> -Jason
>> 
>> 
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website, 
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>> all
>> things parallel software development, from weekly thought leadership blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the 
>> conversation now. 
>> http://goparallel.sourceforge.net/
>> 
>> 
>> _______________________________________________
>> Scikit-learn-general mailing list
>> 
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the 
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to