Re: [scikit-learn] One-hot encoding

Sarah Wait Zaranek Mon, 05 Feb 2018 19:48:54 -0800

Thanks, this makes sense. I will try using the CategoricalEncoder to see
the difference. It wouldn't be such a big deal if my input matrix wasn't so
large.  Thanks again for all your help.


Cheers,
Sarah

On Mon, Feb 5, 2018 at 10:33 PM, Joel Nothman <joel.noth...@gmail.com>
wrote:

> Yes, the output CSR representation requires:
> 1 (dtype) value per entry
> 1 int32 per entry
> 1 int32 per row
>
> The intermediate COO representation requires:
> 1 (dtype) value per entry
> 2 int32 per entry
>
> So as long as the transformation from COO to CSR is done over the whole
> data, it will occupy roughly 5x the input size, which is exactly what you
> are experienciong.
>
> The CategoricalEncoder currently available in the development version of
> scikit-learn does not have this problem, but might be slower due to
> handling non-integer categories. It will also possibly disappear and be
> merged into OneHotEncoder soon (see PR #10523).
>
>
>
> On 6 February 2018 at 13:53, Sarah Wait Zaranek <sarah.zara...@gmail.com>
> wrote:
>
>> Yes, of course.  What I mean is the I start out with 19 Gigs (initial
>> matrix size) or so, it balloons to 100 Gigs *within the encoder function*
>> and returns 28 Gigs (sparse one-hot matrix size).  These numbers aren't
>> exact, but you can see my point.
>>
>> Cheers,
>> Sarah
>>
>> On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <joel.noth...@gmail.com>
>> wrote:
>>
>>> OneHotEncoder will not magically reduce the size of your input. It will
>>> necessarily increase the memory of the input data as long as we are storing
>>> the results in scipy.sparse matrices. The sparse representation will be
>>> less expensive than the dense representation, but it won't be less
>>> expensive than the input.
>>>
>>> On 6 February 2018 at 13:24, Sarah Wait Zaranek <sarah.zara...@gmail.com
>>> > wrote:
>>>
>>>> Hi Joel -
>>>>
>>>> I am also seeing a huge overhead in memory for calling the
>>>> onehot-encoder.  I have hacked it by running it splitting by matrix into
>>>> 4-5 smaller matrices (by columns) and then concatenating the results.  But,
>>>> I am seeing upwards of 100 Gigs overhead. Should I file a bug report?  Or
>>>> is this to be expected.
>>>>
>>>> Cheers,
>>>> Sarah
>>>>
>>>> On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek <
>>>> sarah.zara...@gmail.com> wrote:
>>>>
>>>>> Great.  Thank you for all your help.
>>>>>
>>>>> Cheers,
>>>>> Sarah
>>>>>
>>>>> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.noth...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> If you specify n_values=[list_of_vals_for_column1,
>>>>>> list_of_vals_for_column2], you should be able to engineer it to how you
>>>>>> want.
>>>>>>
>>>>>> On 5 February 2018 at 16:31, Sarah Wait Zaranek <
>>>>>> sarah.zara...@gmail.com> wrote:
>>>>>>
>>>>>>> If I use the n+1 approach, then I get the correct matrix, except
>>>>>>> with the columns of zeros:
>>>>>>>
>>>>>>> >>> test
>>>>>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.],
>>>>>>>        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
>>>>>>>        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>>>>        [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek <
>>>>>>> sarah.zara...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Joel -
>>>>>>>>
>>>>>>>> Conceptually, that makes sense.  But when I assign n_values, I
>>>>>>>> can't make it match the result when you don't specify them. See below. 
>>>>>>>>  I
>>>>>>>> used the number of unique levels per column.
>>>>>>>>
>>>>>>>> >>> enc = OneHotEncoder(sparse=False)
>>>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1,
>>>>>>>> 0, 2]])
>>>>>>>> >>> test
>>>>>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.],
>>>>>>>>        [0., 1., 0., 0., 1., 1., 0., 0., 0.],
>>>>>>>>        [1., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>>>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4])
>>>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1,
>>>>>>>> 0, 2]])
>>>>>>>> >>> test
>>>>>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.],
>>>>>>>>        [0., 1., 0., 0., 0., 2., 0., 0., 0.],
>>>>>>>>        [1., 0., 0., 0., 0., 1., 1., 0., 0.],
>>>>>>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Sarah
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Sarah
>>>>>>>>
>>>>>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <
>>>>>>>> joel.noth...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> If each input column is encoded as a value from 0 to the (number
>>>>>>>>> of possible values for that column - 1) then n_values for that column
>>>>>>>>> should be the highest value + 1, which is also the number of levels 
>>>>>>>>> per
>>>>>>>>> column. Does that make sense?
>>>>>>>>>
>>>>>>>>> Actually, I've realised there's a somewhat slow and unnecessary
>>>>>>>>> bit of code in the one-hot encoder: where the COO matrix is converted 
>>>>>>>>> to
>>>>>>>>> CSR. I suspect this was done because most of our ML algorithms perform
>>>>>>>>> better on CSR, or else to maintain backwards compatibility with an 
>>>>>>>>> earlier
>>>>>>>>> implementation.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> scikit-learn mailing list
>>>>>>>>> scikit-learn@python.org
>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> scikit-learn@python.org
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn@python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn@python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] One-hot encoding

Reply via email to