Re: [scikit-learn] behaviour of OneHotEncoder somewhat confusing

Joel Nothman Mon, 19 Sep 2016 16:50:01 -0700

OneHotCoder has issues, but I think all you want here is

ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))


Still, this seems like it is far from the intended use of OneHotEncoder
(which should not really be stacked with LabelEncoder), so it's not
surprising it's tricky.

On 20 September 2016 at 08:07, Sebastian Raschka <[email protected]>
wrote:

> Hi, Lee,
>
> maybe set `n_value=4`, this seems to do the job. I think the problem you
> encountered is due to the fact that the one-hot encoder infers the number
> of values for each feature (column) from the dataset. In your case, each
> column had only 1 unique feature in your example
>
> > array([[0, 1, 2, 3],
> >        [0, 1, 2, 3],
> >        [0, 1, 2, 3]])
>
> If you had an array like
>
> > array([[0],
> >           [1],
> >           [2],
> >          [3]])
>
> it should work though. Alternatively, set n_values to 4:
>
>
> > >>> from sklearn.preprocessing import OneHotEncoder
> > >>> import numpy as np
> >
> > >>> enc = OneHotEncoder(n_values=4)
> > >>> X = np.array([[0, 1, 2, 3]])
> > >>> enc.fit_transform(X).toarray()
>
>
> array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.]])
>
> and
>
> > X2 = np.array([[0, 1, 2, 3],
> >                [0, 1, 2, 3],
> >                [0, 1, 2, 3]])
> >
> > enc.transform(X2).toarray()
>
>
>
> array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.],
>        [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.],
>        [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.]])
>
>
> Best,
> Sebastian
>
>
> > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <[email protected]> wrote:
> >
> > Hi sklearners,
> >
> > A lab-mate came to me with a problem about encoding DNA sequences using
> preprocessing.OneHotEncoder, and I find it to produce confusing results.
> >
> > Suppose I have a DNA string:  myguide = ‘ACGT’
> >
> > He’d like use OneHotEncoder to transform DNA strings, character by
> character, into a one hot encoded representation like this: [[1,0,0,0],
> [0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems to be solved in
> pandas using the dubiously named get_dummies method (
> http://pandas.pydata.org/pandas-docs/version/0.13.1/
> generated/pandas.get_dummies.html).  I thought that it would be trivial
> to do with OneHotEncoder, but it seems strangely difficult:
> >
> > In [23]: myarray = le.fit_transform([c for c in myguide])
> >
> > In [24]: myarray
> > Out[24]: array([0, 1, 2, 3])
> >
> > In [27]: myarray = le.transform([[c for c in myguide],[c for c in
> myguide],[c for c in myguide]])
> >
> > In [28]: myarray
> > Out[28]:
> > array([[0, 1, 2, 3],
> >        [0, 1, 2, 3],
> >        [0, 1, 2, 3]])
> >
> > In [29]: ohe.fit_transform(myarray)
> > Out[29]:
> > array([[ 1.,  1.,  1.,  1.],
> >        [ 1.,  1.,  1.,  1.],
> >        [ 1.,  1.,  1.,  1.]])    <— ????
> >
> > So this is not at all what I expected.  I read the documentation for
> OneHotEncoder (http://scikit-learn.org/stable/modules/generated/
> sklearn.preprocessing.OneHotEncoder.html#sklearn.
> preprocessing.OneHotEncoder), but did not find if clear how it worked
> (also I found the example using integers confusing).  Neither FeatureHasher
> nor DictVectorizer seem to be more appropriate for transforming strings
> into positional OneHot encoded arrays.  Am I missing something, or is this
> operation not supported in sklearn?
> >
> > Thanks,
> >
> > --
> > Lee Zamparo
> > _______________________________________________
> > scikit-learn mailing list
> > [email protected]
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] behaviour of OneHotEncoder somewhat confusing

Reply via email to