Hi sklearners, A lab-mate came to me with a problem about encoding DNA sequences using preprocessing.OneHotEncoder, and I find it to produce confusing results.
Suppose I have a DNA string: myguide = ‘ACGT’ He’d like use OneHotEncoder to transform DNA strings, character by character, into a one hot encoded representation like this: [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]]. The use-case seems to be solved in pandas using the dubiously named get_dummies method ( http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html). I thought that it would be trivial to do with OneHotEncoder, but it seems strangely difficult: In [23]: myarray = le.fit_transform([c for c in myguide]) In [24]: myarray Out[24]: array([0, 1, 2, 3]) In [27]: myarray = le.transform([[c for c in myguide],[c for c in myguide],[c for c in myguide]]) In [28]: myarray Out[28]: array([[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]) In [29]: ohe.fit_transform(myarray) Out[29]: array([[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]]) <— ???? So this is not at all what I expected. I read the documentation for OneHotEncoder ( http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder), but did not find if clear how it worked (also I found the example using integers confusing). Neither FeatureHasher nor DictVectorizer seem to be more appropriate for transforming strings into positional OneHot encoded arrays. Am I missing something, or is this operation not supported in sklearn? Thanks, -- Lee Zamparo
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn