Hi Sebastian, Great, thanks!
The docstring doesn’t make it very clear that using the default ’n_values=‘auto’ infers the number of different values column-wise; maybe I could do a quick PR to update it? Or, maybe I could make your example into a, well, example for the documentation online? Alternatively, if you think this case is too off-usage for OneHotEncoder, maybe doing nothing is the best course? Thanks, -- Lee Zamparo On September 19, 2016 at 6:08:15 PM, Sebastian Raschka (se.rasc...@gmail.com) wrote: Hi, Lee, maybe set `n_value=4`, this seems to do the job. I think the problem you encountered is due to the fact that the one-hot encoder infers the number of values for each feature (column) from the dataset. In your case, each column had only 1 unique feature in your example > array([[0, 1, 2, 3], > [0, 1, 2, 3], > [0, 1, 2, 3]]) If you had an array like > array([[0], > [1], > [2], > [3]]) it should work though. Alternatively, set n_values to 4: > >>> from sklearn.preprocessing import OneHotEncoder > >>> import numpy as np > > >>> enc = OneHotEncoder(n_values=4) > >>> X = np.array([[0, 1, 2, 3]]) > >>> enc.fit_transform(X).toarray() array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.]]) and > X2 = np.array([[0, 1, 2, 3], > [0, 1, 2, 3], > [0, 1, 2, 3]]) > > enc.transform(X2).toarray() array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.], [ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.], [ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.]]) Best, Sebastian > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamp...@gmail.com> wrote: > > Hi sklearners, > > A lab-mate came to me with a problem about encoding DNA sequences using preprocessing.OneHotEncoder, and I find it to produce confusing results. > > Suppose I have a DNA string: myguide = ‘ACGT’ > > He’d like use OneHotEncoder to transform DNA strings, character by character, into a one hot encoded representation like this: [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]]. The use-case seems to be solved in pandas using the dubiously named get_dummies method ( http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html). I thought that it would be trivial to do with OneHotEncoder, but it seems strangely difficult: > > In [23]: myarray = le.fit_transform([c for c in myguide]) > > In [24]: myarray > Out[24]: array([0, 1, 2, 3]) > > In [27]: myarray = le.transform([[c for c in myguide],[c for c in myguide],[c for c in myguide]]) > > In [28]: myarray > Out[28]: > array([[0, 1, 2, 3], > [0, 1, 2, 3], > [0, 1, 2, 3]]) > > In [29]: ohe.fit_transform(myarray) > Out[29]: > array([[ 1., 1., 1., 1.], > [ 1., 1., 1., 1.], > [ 1., 1., 1., 1.]]) <— ???? > > So this is not at all what I expected. I read the documentation for OneHotEncoder ( http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder), but did not find if clear how it worked (also I found the example using integers confusing). Neither FeatureHasher nor DictVectorizer seem to be more appropriate for transforming strings into positional OneHot encoded arrays. Am I missing something, or is this operation not supported in sklearn? > > Thanks, > > -- > Lee Zamparo > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn