Re: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

Andreas Mueller Wed, 19 Dec 2018 14:33:26 -0800


On 12/15/18 7:35 AM, Joris Van den Bossche wrote:

Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <t3k...@gmail.com<mailto:t3k...@gmail.com>>:
    As far as I understand, the open PR is not a leave-one-out
    TargetEncoder?
    I would want it to be :-/
    I also did not yet add the CountFeaturizer from that scikit-learn
    PR, because it is actually quite different (e.g it doesn't work
    for regression tasks, as it counts conditional on y). But for
    classification it could be easily added to the benchmarks.
    I'm confused now. That's what TargetEncoder and leave-one-out
    TargetEncoder do as well, right?.
As far as I understand, that is not exactly what those do. TheTargetEncoder (as implemented in dirty_cat, category_encoders andhccEncoders) will, for each category, calculate the expected value ofthe target depending on the category. For binary classification thisindeed comes to counting the 0's and 1's, and there the informationcontained in the result might be similar as the sklearn PR, but theformat is different: those packages calculate the probability (valuebetween 0 and 1 as number of 1's divided by number of samples in thatcategory) and return that as a single column, instead of returning twocolumns with the counts for the 0's and 1's.

This is a standard case of the "binary special case", right? Formulti-class you need multiple columns, right?

Doing a single column for binary makes sense, I think.

And for regression this is not related to counting anymore, but justthe average of the target per category (in practice, the TargetEncoderis computing the same for regression or binary classification: theaverage of the target per category. But for regression, theCountFeaturizer doesn't work since there are no discrete values in thetarget to count).

I guess CountFeaturizer was not implemented with regression in mind.

Actually being able to do regression and classification in the sameestimator shows that "CountFeaturizer"

is probably the wrong name.

Furthermore, all of those implementations in the 3 mentioned packageshave some kind of regularization (empirical bayes shrinkage, or KFoldor leave-one-out cross-validation), while this is also not present inthe CountFeaturizer PR (but this aspect is of course something we wantto actually test in the benchmarks).
Another thing I noticed in the CountFeaturizer implementation, is thatthe behaviour differs when y is passed or not. First, I find it a bitstrange to do this as it is a quite different behaviour (counting thecategories (to just encode the categorical variable with a notionabout its frequency in the training set), or counting the targetdepending on the category is quite different?). But also, when using atransformer in a Pipeline, you don't control the passing of y, Ithink? So in that way, you always have the behaviour of counting thetarget.I would find it more logical to have those two things in two separatetransformers (if we think the "frequency encoder" is useful enough).(I need to give this feedback on the PR, but that will be for afterthe holidays)

I'm pretty sure I mentioned that before, I think optional y is bad. Ijust thought it was weird but the pipeline argument is a good one.

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

Reply via email to