On Tue, Nov 20, 2018 at 04:06:30PM -0500, Andreas Mueller wrote: > I would love to see the TargetEncoder ported to scikit-learn. > The CountFeaturizer is pretty stalled: > https://github.com/scikit-learn/scikit-learn/pull/9614
So would I. But there are several ways of doing it: - the naive way is not the right one: just computing the average of y for each category leads to overfitting quite fast - it can be done cross-validated, splitting the train data, in a "cross-fit" strategy (see https://github.com/dirty-cat/dirty_cat/issues/53) - it can be done using empirical-Bayes shrinkage, which is what we currently do in dirty_cat. We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid. > Have you benchmarked the other encoders in the category_encoding lib? > I would be really curious to know when/how they help. We did (part of the results are in the publication), and we didn't have great success. Gaël > On 11/20/18 3:58 PM, Gael Varoquaux wrote: > > Hi scikit-learn friends, > > As you might have seen on twitter, my lab -with a few friends- has > > embarked on research to ease machine on "dirty data". We are > > experimenting on new encoding methods for non-curated string categories. > > For this, we are developing a small software project called "dirty_cat": > > https://dirty-cat.github.io/stable/ > > dirty_cat is a test bed for new ideas of "dirty categories". It is a > > research project, though we still try to do decent software engineering > > :). Rather than contributing to existing codebases (as the great > > categorical-encoding project in scikit-learn-contrib), we spanned it out > > in a separate software project to have the freedom to try out ideas that > > we might give up after gaining insight. > > We hope that it is a useful tool: if you have non-curated string > > categories, please give it a try. Understanding what works and what does > > not is important to know what to consolidate. Hopefully one day we can > > develop a tool that is of wide-enough interest that it can go in > > scikit-learn-contrib, or maybe even scikit-learn. > > Also, if you have suggestions of publicly available databases that we try > > it upon, we would love to hear from you. > > Cheers, > > Gaël > > PS: if you want to work on dirty-data problems in Paris as a post-doc or > > an engineer, send me a line > > _______________________________________________ > > scikit-learn mailing list > > [email protected] > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux _______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
