Hi Joel - 20 million categorical variables. It comes from segmenting the genome into 20 million parts. Genomes are big :) For n_values, I am a bit confused. Is the input the same as the output for n values. Originally, I thought it was just the number of levels per column, but it seems like it is more like the highest value of the levels (in terms of integers).
Cheers, Sarah On Sun, Feb 4, 2018 at 11:27 PM, Joel Nothman <joel.noth...@gmail.com> wrote: > 20 million categories, or 20 million categorical variables? > > OneHotEncoder is pretty efficient if you specify n_values. > > On 5 February 2018 at 15:10, Sarah Wait Zaranek <sarah.zara...@gmail.com> > wrote: > >> Hello - >> >> I was just wondering if there was a way to improve performance on the >> one-hot encoder. Or, is there any plans to do so in the future? I am >> working with a matrix that will ultimately have 20 million categorical >> variables, and my bottleneck is the one-hot encoder. >> >> Let me know if this isn't the place to inquire. My code is very simple >> when using the encoder, but I cut and pasted it here for completeness. >> >> enc = OneHotEncoder(sparse=True) >> Xtrain = enc.fit_transform(tiledata) >> >> >> Thanks, >> Sarah >> >> >> _______________________________________________ >> scikit-learn mailing list >> firstname.lastname@example.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > email@example.com > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list firstname.lastname@example.org https://mail.python.org/mailman/listinfo/scikit-learn