On 12/15/18 7:35 AM, Joris Van den Bossche wrote:
Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <t3k...@gmail.com
<mailto:t3k...@gmail.com>>:
As far as I understand, the open PR is not a leave-one-out
TargetEncoder?
I would want it to be :-/
I also did not yet add the CountFeaturizer from that scikit-learn
PR, because it is actually quite different (e.g it doesn't work
for regression tasks, as it counts conditional on y). But for
classification it could be easily added to the benchmarks.
I'm confused now. That's what TargetEncoder and leave-one-out
TargetEncoder do as well, right?.
As far as I understand, that is not exactly what those do. The
TargetEncoder (as implemented in dirty_cat, category_encoders and
hccEncoders) will, for each category, calculate the expected value of
the target depending on the category. For binary classification this
indeed comes to counting the 0's and 1's, and there the information
contained in the result might be similar as the sklearn PR, but the
format is different: those packages calculate the probability (value
between 0 and 1 as number of 1's divided by number of samples in that
category) and return that as a single column, instead of returning two
columns with the counts for the 0's and 1's.
This is a standard case of the "binary special case", right? For
multi-class you need multiple columns, right?
Doing a single column for binary makes sense, I think.
And for regression this is not related to counting anymore, but just
the average of the target per category (in practice, the TargetEncoder
is computing the same for regression or binary classification: the
average of the target per category. But for regression, the
CountFeaturizer doesn't work since there are no discrete values in the
target to count).
I guess CountFeaturizer was not implemented with regression in mind.
Actually being able to do regression and classification in the same
estimator shows that "CountFeaturizer"
is probably the wrong name.
Furthermore, all of those implementations in the 3 mentioned packages
have some kind of regularization (empirical bayes shrinkage, or KFold
or leave-one-out cross-validation), while this is also not present in
the CountFeaturizer PR (but this aspect is of course something we want
to actually test in the benchmarks).
Another thing I noticed in the CountFeaturizer implementation, is that
the behaviour differs when y is passed or not. First, I find it a bit
strange to do this as it is a quite different behaviour (counting the
categories (to just encode the categorical variable with a notion
about its frequency in the training set), or counting the target
depending on the category is quite different?). But also, when using a
transformer in a Pipeline, you don't control the passing of y, I
think? So in that way, you always have the behaviour of counting the
target.
I would find it more logical to have those two things in two separate
transformers (if we think the "frequency encoder" is useful enough).
(I need to give this feedback on the PR, but that will be for after
the holidays)
I'm pretty sure I mentioned that before, I think optional y is bad. I
just thought it was weird but the pipeline argument is a good one.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn