Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

Gael Varoquaux Tue, 20 Nov 2018 13:46:15 -0800

On Tue, Nov 20, 2018 at 04:35:43PM -0500, Andreas Mueller wrote:
> > - it can be done cross-validated, splitting the train data, in a
> >    "cross-fit" strategy 
> > (seehttps://github.com/dirty-cat/dirty_cat/issues/53)
> This is called leave-one-out in the category_encoding library, I think,
> and that's what my first implementation would be.


> > - it can be done using empirical-Bayes shrinkage, which is what we
> >    currently do in dirty_cat.
> Reference / explanation?

I think that a good reference is the prior art part of our paper:
https://arxiv.org/abs/1806.00979

But we found the following reference helpful
Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical 
attributes in
classification and prediction problems. ACM SIGKDD Explorations Newsletter 3(1),
27–32 (2001)

> > We are planning to do heavy benchmarking of those strategies, to figure
> > out tradeoff. But we won't get to it before February, I am afraid.
> aww ;)

Yeah. I do slow science. Slow everything, actually :(.

Gaël
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

Reply via email to