On 12/15/18 7:35 AM, Joris Van den Bossche wrote:
Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <t3k...@gmail.com <mailto:t3k...@gmail.com>>:

    As far as I understand, the open PR is not a leave-one-out
    TargetEncoder?
    I would want it to be :-/
    I also did not yet add the CountFeaturizer from that scikit-learn
    PR, because it is actually quite different (e.g it doesn't work
    for regression tasks, as it counts conditional on y). But for
    classification it could be easily added to the benchmarks.
    I'm confused now. That's what TargetEncoder and leave-one-out
    TargetEncoder do as well, right?.


As far as I understand, that is not exactly what those do. The TargetEncoder (as implemented in dirty_cat, category_encoders and hccEncoders) will, for each category, calculate the expected value of the target depending on the category. For binary classification this indeed comes to counting the 0's and 1's, and there the information contained in the result might be similar as the sklearn PR, but the format is different: those packages calculate the probability (value between 0 and 1 as number of 1's divided by number of samples in that category) and return that as a single column, instead of returning two columns with the counts for the 0's and 1's.
This is a standard case of the "binary special case", right? For multi-class you need multiple columns, right?
Doing a single column for binary makes sense, I think.

And for regression this is not related to counting anymore, but just the average of the target per category (in practice, the TargetEncoder is computing the same for regression or binary classification: the average of the target per category. But for regression, the CountFeaturizer doesn't work since there are no discrete values in the target to count).
I guess CountFeaturizer was not implemented with regression in mind.
Actually being able to do regression and classification in the same estimator shows that "CountFeaturizer"
is probably the wrong name.


Furthermore, all of those implementations in the 3 mentioned packages have some kind of regularization (empirical bayes shrinkage, or KFold or leave-one-out cross-validation), while this is also not present in the CountFeaturizer PR (but this aspect is of course something we want to actually test in the benchmarks).

Another thing I noticed in the CountFeaturizer implementation, is that the behaviour differs when y is passed or not. First, I find it a bit strange to do this as it is a quite different behaviour (counting the categories (to just encode the categorical variable with a notion about its frequency in the training set), or counting the target depending on the category is quite different?). But also, when using a transformer in a Pipeline, you don't control the passing of y, I think? So in that way, you always have the behaviour of counting the target. I would find it more logical to have those two things in two separate transformers (if we think the "frequency encoder" is useful enough). (I need to give this feedback on the PR, but that will be for after the holidays)

I'm pretty sure I mentioned that before, I think optional y is bad. I just thought it was weird but the pipeline argument is a good one.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to