On Wed, Nov 21, 2018 at 09:47:13AM -0500, Andreas Mueller wrote: > The PR is over a year old already, and you hadn't voiced any opposition > there.
My bad, sorry. Given the name, I had not guessed the link between the PR and encoding of categorical features. I find myself very much in agreement with the original issue and its discussion: https://github.com/scikit-learn/scikit-learn/issues/5853 concerns about the name and importance of at least considering prior smoothing. I do not see these reflected in the PR. In general, the fact that there is not much literature on this implies that we should be benchmarking our choices. The more I understand kaggle, the less I think that we can fully use it as an inclusion argument: people do transforms that end up to be very specific to one challenge. On the specific problem of categorical encoding, we've tried to do systematic analysis of some of these, and were not very successful empirically (eg hashing encoding). This is not at all a vote against target encoding, which our benchmarks showed was very useful, but just a push for benchmarking PRs, in particular when they do not correspond to well cited work (which is our standard inclusion criterion). Joris has just accepted to help with benchmarking. We can have preliminary results earlier. The question really is: out of the different variants that exist, which one should we choose. I think that it is a legitimate question that arises on many of our PRs. But in general, I don't think that we should rush things because of deadlines. Consequences of a rush are that we need to change things after merge, which is more work. I know that it is slow, but we are quite a central package. Gaƫl _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn