Re: [scikit-learn] One-hot encoding

Sarah Wait Zaranek Sun, 04 Feb 2018 22:08:11 -0800

Great.  Thank you for all your help.

Cheers,
Sarah


On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.noth...@gmail.com>
wrote:

> If you specify n_values=[list_of_vals_for_column1,
> list_of_vals_for_column2], you should be able to engineer it to how you
> want.
>
> On 5 February 2018 at 16:31, Sarah Wait Zaranek <sarah.zara...@gmail.com>
> wrote:
>
>> If I use the n+1 approach, then I get the correct matrix, except with the
>> columns of zeros:
>>
>> >>> test
>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.],
>>        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
>>        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.],
>>        [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
>>
>>
>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek <
>> sarah.zara...@gmail.com> wrote:
>>
>>> Hi Joel -
>>>
>>> Conceptually, that makes sense.  But when I assign n_values, I can't
>>> make it match the result when you don't specify them. See below.  I used
>>> the number of unique levels per column.
>>>
>>> >>> enc = OneHotEncoder(sparse=False)
>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0,
>>> 2]])
>>> >>> test
>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.],
>>>        [0., 1., 0., 0., 1., 1., 0., 0., 0.],
>>>        [1., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4])
>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0,
>>> 2]])
>>> >>> test
>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.],
>>>        [0., 1., 0., 0., 0., 2., 0., 0., 0.],
>>>        [1., 0., 0., 0., 0., 1., 1., 0., 0.],
>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>
>>> Cheers,
>>> Sarah
>>>
>>> Cheers,
>>> Sarah
>>>
>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.noth...@gmail.com>
>>> wrote:
>>>
>>>> If each input column is encoded as a value from 0 to the (number of
>>>> possible values for that column - 1) then n_values for that column should
>>>> be the highest value + 1, which is also the number of levels per column.
>>>> Does that make sense?
>>>>
>>>> Actually, I've realised there's a somewhat slow and unnecessary bit of
>>>> code in the one-hot encoder: where the COO matrix is converted to CSR. I
>>>> suspect this was done because most of our ML algorithms perform better on
>>>> CSR, or else to maintain backwards compatibility with an earlier
>>>> implementation.
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn@python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] One-hot encoding

Reply via email to