On Wed, Nov 21, 2018 at 11:35:11AM -0500, Andreas Mueller wrote: > The question for this particular issue for me is also "what are good > benchmark datasets". > In dirty cat you used dirty categories, which is a subset of all > high-cardinality categorical > variables. > Whether "clean" high cardinality variables like zip-codes or dirty ones are > the better > benchmark is a bit unclear to me, and I'm not aware of a wealth of datasets > for either :-/
Fair point. We'll have a look to see what we can find. We're open to suggestions, from you or from anyone else. G _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn