Hi Joel -
20 million categorical variables. It comes from segmenting the genome into
20 million parts. Genomes are big :) For n_values, I am a bit confused.
Is the input the same as the output for n values. Originally, I thought it
was just the number of levels per column, but it seems like it is more like
the highest value of the levels (in terms of integers).
On Sun, Feb 4, 2018 at 11:27 PM, Joel Nothman <joel.noth...@gmail.com>
> 20 million categories, or 20 million categorical variables?
> OneHotEncoder is pretty efficient if you specify n_values.
> On 5 February 2018 at 15:10, Sarah Wait Zaranek <sarah.zara...@gmail.com>
>> Hello -
>> I was just wondering if there was a way to improve performance on the
>> one-hot encoder. Or, is there any plans to do so in the future? I am
>> working with a matrix that will ultimately have 20 million categorical
>> variables, and my bottleneck is the one-hot encoder.
>> Let me know if this isn't the place to inquire. My code is very simple
>> when using the encoder, but I cut and pasted it here for completeness.
>> enc = OneHotEncoder(sparse=True)
>> Xtrain = enc.fit_transform(tiledata)
>> scikit-learn mailing list
> scikit-learn mailing list
scikit-learn mailing list