20 million categories, or 20 million categorical variables?
OneHotEncoder is pretty efficient if you specify n_values.
On 5 February 2018 at 15:10, Sarah Wait Zaranek <sarah.zara...@gmail.com>
> Hello -
> I was just wondering if there was a way to improve performance on the
> one-hot encoder. Or, is there any plans to do so in the future? I am
> working with a matrix that will ultimately have 20 million categorical
> variables, and my bottleneck is the one-hot encoder.
> Let me know if this isn't the place to inquire. My code is very simple
> when using the encoder, but I cut and pasted it here for completeness.
> enc = OneHotEncoder(sparse=True)
> Xtrain = enc.fit_transform(tiledata)
> scikit-learn mailing list
scikit-learn mailing list