On Tue, Nov 20, 2018 at 04:35:43PM -0500, Andreas Mueller wrote: > > - it can be done cross-validated, splitting the train data, in a > > "cross-fit" strategy > > (seehttps://github.com/dirty-cat/dirty_cat/issues/53) > This is called leave-one-out in the category_encoding library, I think, > and that's what my first implementation would be.
> > - it can be done using empirical-Bayes shrinkage, which is what we > > currently do in dirty_cat. > Reference / explanation? I think that a good reference is the prior art part of our paper: https://arxiv.org/abs/1806.00979 But we found the following reference helpful Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3(1), 27–32 (2001) > > We are planning to do heavy benchmarking of those strategies, to figure > > out tradeoff. But we won't get to it before February, I am afraid. > aww ;) Yeah. I do slow science. Slow everything, actually :(. Gaël _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn