OneHotCoder has issues, but I think all you want here is ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))
Still, this seems like it is far from the intended use of OneHotEncoder (which should not really be stacked with LabelEncoder), so it's not surprising it's tricky. On 20 September 2016 at 08:07, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Hi, Lee, > > maybe set `n_value=4`, this seems to do the job. I think the problem you > encountered is due to the fact that the one-hot encoder infers the number > of values for each feature (column) from the dataset. In your case, each > column had only 1 unique feature in your example > > > array([[0, 1, 2, 3], > > [0, 1, 2, 3], > > [0, 1, 2, 3]]) > > If you had an array like > > > array([[0], > > [1], > > [2], > > [3]]) > > it should work though. Alternatively, set n_values to 4: > > > > >>> from sklearn.preprocessing import OneHotEncoder > > >>> import numpy as np > > > > >>> enc = OneHotEncoder(n_values=4) > > >>> X = np.array([[0, 1, 2, 3]]) > > >>> enc.fit_transform(X).toarray() > > > array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., > 0., 0., 1.]]) > > and > > > X2 = np.array([[0, 1, 2, 3], > > [0, 1, 2, 3], > > [0, 1, 2, 3]]) > > > > enc.transform(X2).toarray() > > > > array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., > 0., 0., 1.], > [ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., > 0., 0., 1.], > [ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., > 0., 0., 1.]]) > > > Best, > Sebastian > > > > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamp...@gmail.com> wrote: > > > > Hi sklearners, > > > > A lab-mate came to me with a problem about encoding DNA sequences using > preprocessing.OneHotEncoder, and I find it to produce confusing results. > > > > Suppose I have a DNA string: myguide = ‘ACGT’ > > > > He’d like use OneHotEncoder to transform DNA strings, character by > character, into a one hot encoded representation like this: [[1,0,0,0], > [0,1,0,0], [0,0,1,0], [0,0,0,1]]. The use-case seems to be solved in > pandas using the dubiously named get_dummies method ( > http://pandas.pydata.org/pandas-docs/version/0.13.1/ > generated/pandas.get_dummies.html). I thought that it would be trivial > to do with OneHotEncoder, but it seems strangely difficult: > > > > In [23]: myarray = le.fit_transform([c for c in myguide]) > > > > In [24]: myarray > > Out[24]: array([0, 1, 2, 3]) > > > > In [27]: myarray = le.transform([[c for c in myguide],[c for c in > myguide],[c for c in myguide]]) > > > > In [28]: myarray > > Out[28]: > > array([[0, 1, 2, 3], > > [0, 1, 2, 3], > > [0, 1, 2, 3]]) > > > > In [29]: ohe.fit_transform(myarray) > > Out[29]: > > array([[ 1., 1., 1., 1.], > > [ 1., 1., 1., 1.], > > [ 1., 1., 1., 1.]]) <— ???? > > > > So this is not at all what I expected. I read the documentation for > OneHotEncoder (http://scikit-learn.org/stable/modules/generated/ > sklearn.preprocessing.OneHotEncoder.html#sklearn. > preprocessing.OneHotEncoder), but did not find if clear how it worked > (also I found the example using integers confusing). Neither FeatureHasher > nor DictVectorizer seem to be more appropriate for transforming strings > into positional OneHot encoded arrays. Am I missing something, or is this > operation not supported in sklearn? > > > > Thanks, > > > > -- > > Lee Zamparo > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn