OneHotCoder has issues, but I think all you want here is
ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))
Still, this seems like it is far from the intended use of
OneHotEncoder (which should not really be stacked with LabelEncoder),
so it's not surprising it's tricky.
On 20 September 2016 at 08:07, Sebastian Raschka
<se.rasc...@gmail.com <mailto:se.rasc...@gmail.com>> wrote:
Hi, Lee,
maybe set `n_value=4`, this seems to do the job. I think the
problem you encountered is due to the fact that the one-hot
encoder infers the number of values for each feature (column)
from the dataset. In your case, each column had only 1 unique
feature in your example
> array([[0, 1, 2, 3],
> [0, 1, 2, 3],
> [0, 1, 2, 3]])
If you had an array like
> array([[0],
> [1],
> [2],
> [3]])
it should work though. Alternatively, set n_values to 4:
> >>> from sklearn.preprocessing import OneHotEncoder
> >>> import numpy as np
>
> >>> enc = OneHotEncoder(n_values=4)
> >>> X = np.array([[0, 1, 2, 3]])
> >>> enc.fit_transform(X).toarray()
array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
0., 0.,
0., 0., 1.]])
and
> X2 = np.array([[0, 1, 2, 3],
> [0, 1, 2, 3],
> [0, 1, 2, 3]])
>
> enc.transform(X2).toarray()
array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
0., 0.,
0., 0., 1.],
[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
0., 0.,
0., 0., 1.],
[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
0., 0.,
0., 0., 1.]])
Best,
Sebastian
> On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamp...@gmail.com
<mailto:zamp...@gmail.com>> wrote:
>
> Hi sklearners,
>
> A lab-mate came to me with a problem about encoding DNA
sequences using preprocessing.OneHotEncoder, and I find it to
produce confusing results.
>
> Suppose I have a DNA string: myguide = ‘ACGT’
>
> He’d like use OneHotEncoder to transform DNA strings, character
by character, into a one hot encoded representation like this:
[[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]]. The use-case seems
to be solved in pandas using the dubiously named get_dummies
method
(http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html
<http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html>).
I thought that it would be trivial to do with OneHotEncoder, but
it seems strangely difficult:
>
> In [23]: myarray = le.fit_transform([c for c in myguide])
>
> In [24]: myarray
> Out[24]: array([0, 1, 2, 3])
>
> In [27]: myarray = le.transform([[c for c in myguide],[c for c
in myguide],[c for c in myguide]])
>
> In [28]: myarray
> Out[28]:
> array([[0, 1, 2, 3],
> [0, 1, 2, 3],
> [0, 1, 2, 3]])
>
> In [29]: ohe.fit_transform(myarray)
> Out[29]:
> array([[ 1., 1., 1., 1.],
> [ 1., 1., 1., 1.],
> [ 1., 1., 1., 1.]]) <— ????
>
> So this is not at all what I expected. I read the
documentation for OneHotEncoder
(http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder>),
but did not find if clear how it worked (also I found the example
using integers confusing). Neither FeatureHasher nor
DictVectorizer seem to be more appropriate for transforming
strings into positional OneHot encoded arrays. Am I missing
something, or is this operation not supported in sklearn?
>
> Thanks,
>
> --
> Lee Zamparo
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org <mailto:scikit-learn@python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
<https://mail.python.org/mailman/listinfo/scikit-learn>
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
<https://mail.python.org/mailman/listinfo/scikit-learn>
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn