Re: [scikit-learn] behaviour of OneHotEncoder somewhat confusing

Andreas Mueller Wed, 21 Sep 2016 22:10:54 -0700

Yeah the input format is a bit odd, usually it should be n_samples xn_features, so something like

[['A'], ['C'], ['T'], ['G']]

Though this is currently also hard to do :(


On 09/20/2016 05:50 AM, Lee Zamparo wrote:

Hi Joel,

Yea, seems that the one-hot encoding of the transpose solves theissue. As you say, and as I mentioned to Sebastian, it seems a bitoff-usage for OneHotEncoder.


Thanks for the solution all the same though.

--
Lee Zamparo

On September 19, 2016 at 7:48:15 PM, Joel Nothman(joel.noth...@gmail.com <mailto:joel.noth...@gmail.com>) wrote:

OneHotCoder has issues, but I think all you want here is

ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))

Still, this seems like it is far from the intended use ofOneHotEncoder (which should not really be stacked with LabelEncoder),so it's not surprising it's tricky.

On 20 September 2016 at 08:07, Sebastian Raschka<se.rasc...@gmail.com <mailto:se.rasc...@gmail.com>> wrote:


    Hi, Lee,

    maybe set `n_value=4`, this seems to do the job. I think the
    problem you encountered is due to the fact that the one-hot
    encoder infers the number of values for each feature (column)
    from the dataset. In your case, each column had only 1 unique
    feature in your example

    > array([[0, 1, 2, 3],
    >        [0, 1, 2, 3],
    >        [0, 1, 2, 3]])

    If you had an array like

    > array([[0],
    >           [1],
    >           [2],
    >          [3]])

    it should work though. Alternatively, set n_values to 4:


    > >>> from sklearn.preprocessing import OneHotEncoder
    > >>> import numpy as np
    >
    > >>> enc = OneHotEncoder(n_values=4)
    > >>> X = np.array([[0, 1, 2, 3]])
    > >>> enc.fit_transform(X).toarray()


    array([[ 1.,  0.,  0.,  0.,  0., 1.,  0.,  0.,  0.,  0.,  1.,
    0.,  0.,
             0.,  0.,  1.]])

    and

    > X2 = np.array([[0, 1, 2, 3],
    >  [0, 1, 2, 3],
    >                [0, 1, 2, 3]])
    >
    > enc.transform(X2).toarray()



    array([[ 1.,  0.,  0.,  0.,  0., 1.,  0.,  0.,  0.,  0.,  1.,
    0.,  0.,
             0.,  0.,  1.],

[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,0., 0.,

             0.,  0.,  1.],

[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,0., 0.,

             0.,  0.,  1.]])


    Best,
    Sebastian


    > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamp...@gmail.com
    <mailto:zamp...@gmail.com>> wrote:
    >
    > Hi sklearners,
    >
    > A lab-mate came to me with a problem about encoding DNA
    sequences using preprocessing.OneHotEncoder, and I find it to
    produce confusing results.
    >
    > Suppose I have a DNA string:  myguide = ‘ACGT’
    >
    > He’d like use OneHotEncoder to transform DNA strings, character
    by character, into a one hot encoded representation like this:
    [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems
    to be solved in pandas using the dubiously named get_dummies
    method
    
(http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html
    
<http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html>).
    I thought that it would be trivial to do with OneHotEncoder, but
    it seems strangely difficult:
    >
    > In [23]: myarray = le.fit_transform([c for c in myguide])
    >
    > In [24]: myarray
    > Out[24]: array([0, 1, 2, 3])
    >
    > In [27]: myarray = le.transform([[c for c in myguide],[c for c
    in myguide],[c for c in myguide]])
    >
    > In [28]: myarray
    > Out[28]:
    > array([[0, 1, 2, 3],
    >        [0, 1, 2, 3],
    >        [0, 1, 2, 3]])
    >
    > In [29]: ohe.fit_transform(myarray)
    > Out[29]:
    > array([[ 1.,  1.,  1.,  1.],
    >        [ 1.,  1.,  1., 1.],
    >        [ 1.,  1.,  1., 1.]])    <— ????
    >
    > So this is not at all what I expected.  I read the
    documentation for OneHotEncoder
    
(http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
    
<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder>),
    but did not find if clear how it worked (also I found the example
    using integers confusing).  Neither FeatureHasher nor
    DictVectorizer seem to be more appropriate for transforming
    strings into positional OneHot encoded arrays.  Am I missing
    something, or is this operation not supported in sklearn?
    >
    > Thanks,
    >
    > --
    > Lee Zamparo
    > _______________________________________________
    > scikit-learn mailing list
    > scikit-learn@python.org <mailto:scikit-learn@python.org>
    > https://mail.python.org/mailman/listinfo/scikit-learn
    <https://mail.python.org/mailman/listinfo/scikit-learn>

    _______________________________________________
    scikit-learn mailing list
    scikit-learn@python.org <mailto:scikit-learn@python.org>
    https://mail.python.org/mailman/listinfo/scikit-learn
    <https://mail.python.org/mailman/listinfo/scikit-learn>


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn



_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] behaviour of OneHotEncoder somewhat confusing

Reply via email to