Yeah the input format is a bit odd, usually it should be n_samples x n_features, so something like
[['A'], ['C'], ['T'], ['G']]

Though this is currently also hard to do :(

On 09/20/2016 05:50 AM, Lee Zamparo wrote:
Hi Joel,

Yea, seems that the one-hot encoding of the transpose solves the issue. As you say, and as I mentioned to Sebastian, it seems a bit off-usage for OneHotEncoder.

Thanks for the solution all the same though.

--
Lee Zamparo

On September 19, 2016 at 7:48:15 PM, Joel Nothman (joel.noth...@gmail.com <mailto:joel.noth...@gmail.com>) wrote:

OneHotCoder has issues, but I think all you want here is

ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))

Still, this seems like it is far from the intended use of OneHotEncoder (which should not really be stacked with LabelEncoder), so it's not surprising it's tricky.

On 20 September 2016 at 08:07, Sebastian Raschka <se.rasc...@gmail.com <mailto:se.rasc...@gmail.com>> wrote:

    Hi, Lee,

    maybe set `n_value=4`, this seems to do the job. I think the
    problem you encountered is due to the fact that the one-hot
    encoder infers the number of values for each feature (column)
    from the dataset. In your case, each column had only 1 unique
    feature in your example

    > array([[0, 1, 2, 3],
    >        [0, 1, 2, 3],
    >        [0, 1, 2, 3]])

    If you had an array like

    > array([[0],
    >           [1],
    >           [2],
    >          [3]])

    it should work though. Alternatively, set n_values to 4:


    > >>> from sklearn.preprocessing import OneHotEncoder
    > >>> import numpy as np
    >
    > >>> enc = OneHotEncoder(n_values=4)
    > >>> X = np.array([[0, 1, 2, 3]])
    > >>> enc.fit_transform(X).toarray()


    array([[ 1.,  0.,  0.,  0.,  0., 1.,  0.,  0.,  0.,  0.,  1.,
    0.,  0.,
             0.,  0.,  1.]])

    and

    > X2 = np.array([[0, 1, 2, 3],
    >  [0, 1, 2, 3],
    >                [0, 1, 2, 3]])
    >
    > enc.transform(X2).toarray()



    array([[ 1.,  0.,  0.,  0.,  0., 1.,  0.,  0.,  0.,  0.,  1.,
    0.,  0.,
             0.,  0.,  1.],
[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
             0.,  0.,  1.],
[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
             0.,  0.,  1.]])


    Best,
    Sebastian


    > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamp...@gmail.com
    <mailto:zamp...@gmail.com>> wrote:
    >
    > Hi sklearners,
    >
    > A lab-mate came to me with a problem about encoding DNA
    sequences using preprocessing.OneHotEncoder, and I find it to
    produce confusing results.
    >
    > Suppose I have a DNA string:  myguide = ‘ACGT’
    >
    > He’d like use OneHotEncoder to transform DNA strings, character
    by character, into a one hot encoded representation like this:
    [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems
    to be solved in pandas using the dubiously named get_dummies
    method
    
(http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html
    
<http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html>).
    I thought that it would be trivial to do with OneHotEncoder, but
    it seems strangely difficult:
    >
    > In [23]: myarray = le.fit_transform([c for c in myguide])
    >
    > In [24]: myarray
    > Out[24]: array([0, 1, 2, 3])
    >
    > In [27]: myarray = le.transform([[c for c in myguide],[c for c
    in myguide],[c for c in myguide]])
    >
    > In [28]: myarray
    > Out[28]:
    > array([[0, 1, 2, 3],
    >        [0, 1, 2, 3],
    >        [0, 1, 2, 3]])
    >
    > In [29]: ohe.fit_transform(myarray)
    > Out[29]:
    > array([[ 1.,  1.,  1.,  1.],
    >        [ 1.,  1.,  1., 1.],
    >        [ 1.,  1.,  1., 1.]])    <— ????
    >
    > So this is not at all what I expected.  I read the
    documentation for OneHotEncoder
    
(http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
    
<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder>),
    but did not find if clear how it worked (also I found the example
    using integers confusing).  Neither FeatureHasher nor
    DictVectorizer seem to be more appropriate for transforming
    strings into positional OneHot encoded arrays.  Am I missing
    something, or is this operation not supported in sklearn?
    >
    > Thanks,
    >
    > --
    > Lee Zamparo
    > _______________________________________________
    > scikit-learn mailing list
    > scikit-learn@python.org <mailto:scikit-learn@python.org>
    > https://mail.python.org/mailman/listinfo/scikit-learn
    <https://mail.python.org/mailman/listinfo/scikit-learn>

    _______________________________________________
    scikit-learn mailing list
    scikit-learn@python.org <mailto:scikit-learn@python.org>
    https://mail.python.org/mailman/listinfo/scikit-learn
    <https://mail.python.org/mailman/listinfo/scikit-learn>


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to