Re: [scikit-learn] behaviour of OneHotEncoder somewhat confusing

Sebastian Raschka Mon, 19 Sep 2016 15:10:18 -0700

Hi, Lee,

maybe set `n_value=4`, this seems to do the job. I think the problem you 
encountered is due to the fact that the one-hot encoder infers the number of 
values for each feature (column) from the dataset. In your case, each column 
had only 1 unique feature in your example


> array([[0, 1, 2, 3],
>        [0, 1, 2, 3],
>        [0, 1, 2, 3]])

If you had an array like

> array([[0],
>           [1],
>           [2],
>          [3]])

it should work though. Alternatively, set n_values to 4:


> >>> from sklearn.preprocessing import OneHotEncoder
> >>> import numpy as np
> 
> >>> enc = OneHotEncoder(n_values=4)
> >>> X = np.array([[0, 1, 2, 3]])
> >>> enc.fit_transform(X).toarray()


array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  1.]])

and 

> X2 = np.array([[0, 1, 2, 3],
>                [0, 1, 2, 3],
>                [0, 1, 2, 3]])
> 
> enc.transform(X2).toarray()



array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  1.],
       [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  1.],
       [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  1.]])


Best,
Sebastian


> On Sep 19, 2016, at 5:45 PM, Lee Zamparo <[email protected]> wrote:
> 
> Hi sklearners,
> 
> A lab-mate came to me with a problem about encoding DNA sequences using 
> preprocessing.OneHotEncoder, and I find it to produce confusing results.
> 
> Suppose I have a DNA string:  myguide = ‘ACGT’
> 
> He’d like use OneHotEncoder to transform DNA strings, character by character, 
> into a one hot encoded representation like this: [[1,0,0,0], [0,1,0,0], 
> [0,0,1,0], [0,0,0,1]].  The use-case seems to be solved in pandas using the 
> dubiously named get_dummies method 
> (http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html).
>   I thought that it would be trivial to do with OneHotEncoder, but it seems 
> strangely difficult:
> 
> In [23]: myarray = le.fit_transform([c for c in myguide])
> 
> In [24]: myarray
> Out[24]: array([0, 1, 2, 3])
> 
> In [27]: myarray = le.transform([[c for c in myguide],[c for c in myguide],[c 
> for c in myguide]])
> 
> In [28]: myarray
> Out[28]:
> array([[0, 1, 2, 3],
>        [0, 1, 2, 3],
>        [0, 1, 2, 3]])
> 
> In [29]: ohe.fit_transform(myarray)
> Out[29]:
> array([[ 1.,  1.,  1.,  1.],
>        [ 1.,  1.,  1.,  1.],
>        [ 1.,  1.,  1.,  1.]])    <— ????
> 
> So this is not at all what I expected.  I read the documentation for 
> OneHotEncoder 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder),
>  but did not find if clear how it worked (also I found the example using 
> integers confusing).  Neither FeatureHasher nor DictVectorizer seem to be 
> more appropriate for transforming strings into positional OneHot encoded 
> arrays.  Am I missing something, or is this operation not supported in 
> sklearn?
> 
> Thanks,
> 
> -- 
> Lee Zamparo
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] behaviour of OneHotEncoder somewhat confusing

Reply via email to