Re: [scikit-learn] One-hot encoding

Sarah Wait Zaranek Sun, 04 Feb 2018 20:34:03 -0800

Hi Joel -

20 million categorical variables.  It comes from segmenting the genome into
20 million parts.  Genomes are big :)  For n_values, I am a bit confused.
Is the input the same as the output for n values.  Originally, I thought it
was just the number of levels per column, but it seems like it is more like
the highest value of the levels (in terms of integers).


Cheers,
Sarah

On Sun, Feb 4, 2018 at 11:27 PM, Joel Nothman <[email protected]>
wrote:

> 20 million categories, or 20 million categorical variables?
>
> OneHotEncoder is pretty efficient if you specify n_values.
>
> On 5 February 2018 at 15:10, Sarah Wait Zaranek <[email protected]>
> wrote:
>
>> Hello -
>>
>> I was just wondering if there was a way to improve performance on the
>> one-hot encoder.  Or, is there any plans to do so in the future?  I am
>> working with a matrix that will ultimately have 20 million categorical
>> variables, and my bottleneck is the one-hot encoder.
>>
>> Let me know if this isn't the place to inquire.  My code is very simple
>> when using the encoder, but I cut and pasted it here for completeness.
>>
>>     enc = OneHotEncoder(sparse=True)
>>     Xtrain = enc.fit_transform(tiledata)
>>
>>
>> Thanks,
>> Sarah
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] One-hot encoding

Reply via email to