Re: [scikit-learn] One-hot encoding

Joel Nothman Mon, 05 Feb 2018 18:53:55 -0800

OneHotEncoder will not magically reduce the size of your input. It will
necessarily increase the memory of the input data as long as we are storing
the results in scipy.sparse matrices. The sparse representation will be
less expensive than the dense representation, but it won't be less
expensive than the input.


On 6 February 2018 at 13:24, Sarah Wait Zaranek <sarah.zara...@gmail.com>
wrote:

> Hi Joel -
>
> I am also seeing a huge overhead in memory for calling the
> onehot-encoder.  I have hacked it by running it splitting by matrix into
> 4-5 smaller matrices (by columns) and then concatenating the results.  But,
> I am seeing upwards of 100 Gigs overhead. Should I file a bug report?  Or
> is this to be expected.
>
> Cheers,
> Sarah
>
> On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek <
> sarah.zara...@gmail.com> wrote:
>
>> Great.  Thank you for all your help.
>>
>> Cheers,
>> Sarah
>>
>> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.noth...@gmail.com>
>> wrote:
>>
>>> If you specify n_values=[list_of_vals_for_column1,
>>> list_of_vals_for_column2], you should be able to engineer it to how you
>>> want.
>>>
>>> On 5 February 2018 at 16:31, Sarah Wait Zaranek <sarah.zara...@gmail.com
>>> > wrote:
>>>
>>>> If I use the n+1 approach, then I get the correct matrix, except with
>>>> the columns of zeros:
>>>>
>>>> >>> test
>>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.],
>>>>        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
>>>>        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>        [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
>>>>
>>>>
>>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek <
>>>> sarah.zara...@gmail.com> wrote:
>>>>
>>>>> Hi Joel -
>>>>>
>>>>> Conceptually, that makes sense.  But when I assign n_values, I can't
>>>>> make it match the result when you don't specify them. See below.  I used
>>>>> the number of unique levels per column.
>>>>>
>>>>> >>> enc = OneHotEncoder(sparse=False)
>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0,
>>>>> 2]])
>>>>> >>> test
>>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.],
>>>>>        [0., 1., 0., 0., 1., 1., 0., 0., 0.],
>>>>>        [1., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4])
>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0,
>>>>> 2]])
>>>>> >>> test
>>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.],
>>>>>        [0., 1., 0., 0., 0., 2., 0., 0., 0.],
>>>>>        [1., 0., 0., 0., 0., 1., 1., 0., 0.],
>>>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>>
>>>>> Cheers,
>>>>> Sarah
>>>>>
>>>>> Cheers,
>>>>> Sarah
>>>>>
>>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.noth...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> If each input column is encoded as a value from 0 to the (number of
>>>>>> possible values for that column - 1) then n_values for that column should
>>>>>> be the highest value + 1, which is also the number of levels per column.
>>>>>> Does that make sense?
>>>>>>
>>>>>> Actually, I've realised there's a somewhat slow and unnecessary bit
>>>>>> of code in the one-hot encoder: where the COO matrix is converted to 
>>>>>> CSR. I
>>>>>> suspect this was done because most of our ML algorithms perform better on
>>>>>> CSR, or else to maintain backwards compatibility with an earlier
>>>>>> implementation.
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn@python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn@python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] One-hot encoding

Reply via email to