Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

Andreas Mueller Wed, 21 Nov 2018 08:37:11 -0800



On 11/21/18 10:34 AM, Gael Varoquaux wrote:

Joris has just accepted to help with benchmarking. We can have
preliminary results earlier. The question really is: out of the different
variants that exist, which one should we choose. I think that it is a
legitimate question that arises on many of our PRs.

Thanks Joris! I could also ask Jan to help ;)

The question for this particular issue for me is also "what are goodbenchmark datasets".It's a somewhat different task than what you're benchmarking with dirtycat, right?In dirty cat you used dirty categories, which is a subset of allhigh-cardinality categorical

variables.

Whether "clean" high cardinality variables like zip-codes or dirty onesare the betterbenchmark is a bit unclear to me, and I'm not aware of a wealth ofdatasets for either :-/


But in general, I don't think that we should rush things because of
deadlines. Consequences of a rush are that we need to change things after
merge, which is more work. I know that it is slow, but we are quite a
central package.

I agree.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

Reply via email to