Hi Sarah, I have some reflection questions. You don't need to answer all of them :) how many categories (approximately) do you have in each of those 20M categorical variables? How many samples do you have? Maybe you should consider different encoding strategies such as binary encoding. Also, this looks like a big data problem. Have you considered using distributed computing? Also, do you really need to use all of those 20M variables in your first approach? Consider using feature selection techniques. I would suggest that you start with something simpler with less features and that run more easily in your machine. Then later you can starting adding more complexity if necessary. Keep in mind that if the number of samples is lower than the number of columns after one hot encoding, you might face overfitting. Try to always have less columns than the number of samples.
On Aug 2, 2018 12:53, "Sarah Wait Zaranek" <sarah.zara...@gmail.com> wrote: Hi Joel - Are you sure? I ran it and it actually uses bit more memory instead of less, same code just run with a different docker container. Max memory used by a single task: 50.41GB vs Max memory used by a single task: 51.15GB Cheers, Sarah On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek <sarah.zara...@gmail.com> wrote: > In the developer version, yes? Looking for the new memory savings :) > > On Wed, Aug 1, 2018, 17:29 Joel Nothman <joel.noth...@gmail.com> wrote: > >> Use OneHotEncoder >> > _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn