Hi sklearners,

A lab-mate came to me with a problem about encoding DNA sequences using
preprocessing.OneHotEncoder, and I find it to produce confusing results.

Suppose I have a DNA string:  myguide = ‘ACGT’

He’d like use OneHotEncoder to transform DNA strings, character by
character, into a one hot encoded representation like this: [[1,0,0,0],
[0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems to be solved in
pandas using the dubiously named get_dummies method (
I thought that it would be trivial to do with OneHotEncoder, but it seems
strangely difficult:

In [23]: myarray = le.fit_transform([c for c in myguide])

In [24]: myarray
Out[24]: array([0, 1, 2, 3])

In [27]: myarray = le.transform([[c for c in myguide],[c for c in
myguide],[c for c in myguide]])

In [28]: myarray
array([[0, 1, 2, 3],
       [0, 1, 2, 3],
       [0, 1, 2, 3]])

In [29]: ohe.fit_transform(myarray)
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])    <— ????

So this is not at all what I expected.  I read the documentation for
OneHotEncoder (
but did not find if clear how it worked (also I found the example using
integers confusing).  Neither FeatureHasher nor DictVectorizer seem to be
more appropriate for transforming strings into positional OneHot encoded
arrays.  Am I missing something, or is this operation not supported in


Lee Zamparo
scikit-learn mailing list

Reply via email to