Well after a bit of tinkering it seems that OneHotEncoder has simple rules
to affect columns to the output:
1) first do the categorical, in the order given by the argument, creating
columns as needed by the values
2) then the numerical

So a piece of code like that seems to work:

>>> fn = []
>>> fc = []
>>> for c in df.columns.values:
>>>     if is_categorical(c):
>>>         fn += sorted(['%s=%s' % (c, v) for v in df[c].unique()])
>>>     else:
>>>         fc += [c]
>>> fn += fc

assuming the original data is in a pandas DataFrame (df) and you have a
list of categorical feature names (is_categorical).

Of course it's pretty retro-engineering the OHE and may break in the future
though.

E/


2015-03-06 9:01 GMT+01:00 Eustache DIEMERT <eusta...@diemert.fr>:

>
> 2015-03-05 16:57 GMT+01:00 Andy <t3k...@gmail.com>:
>
>>  Well, the columns after the OneHotEncoder correspond to feature values,
>> not feature names, right?
>>
>
> Well, for the categorical ones this is right, except that not all my
> features are categorical (hence the categorical_features=...) and they
> are intertwined.
>
> So my problem is more to keep track of which categorical features got
> projected into which columns (1->N) and which numerical ones have been just
> copied and where (1->1).
>
> Re-reading your answer I'm wondering if you suggest to just separate the
> input columns by feature types and apply the encoder to the categorical
> ones only ?
>
>
>
>> There is ``feature_indices_`` which maps each feature to a range of
>> features in the encoded matrix.
>> The features in the input matrix don't really have names in scikit-learn,
>> as they are represented only as numpy matrices.
>> So you need to keep track of the indices of each feature. That shouldn't
>> be too hard, though.
>>
>> Why don't you select the features before the encoding? Or do you want to
>> exclude some values?
>>
>>
>>
>> On 03/05/2015 05:55 AM, Eustache DIEMERT wrote:
>>
>> Hi list,
>>
>>  I have a X (np.array) with some columns containing ids. I also have a
>> list of column names. Then I want to transform the relevant columns to be
>> used by a logistic regression model using OneHotEncoder:
>>
>>  >>> X = np.loadtxt(...) # from a CSV
>> >>> col_names = ... # from CSV header
>>  >>> e = OneHotEncoder(categorical_features=id_columns)
>> >>> Xprime = e.fit_transform(X)
>>
>>  But then I don't know how to deduce the names of the columns in the new
>> matrix :(
>>
>>  Ideally I would want the same as DictVectorizer which has a
>> feature_names_ member.
>>
>>  Anyone already had this problem ?
>>
>>  Eustache
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website, 
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>> all
>> things parallel software development, from weekly thought leadership blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing 
>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website,
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub
>> for all
>> things parallel software development, from weekly thought leadership
>> blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to