Re: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

Andreas Mueller Fri, 14 Dec 2018 07:49:03 -0800


On 12/13/18 4:16 AM, Joris Van den Bossche wrote:

Hi all,
I finally had some time to start looking at it the last days. Somepreliminary work can be found here:https://github.com/jorisvandenbossche/target-encoder-benchmarks.

You continue to be my hero. Probably can not look at it in detail beforethe holidays though :-/

Up to now, I only did some preliminary work to set up the benchmarks(based on Patricio Cerda's code,https://arxiv.org/pdf/1806.00979.pdf), and with some initial datasets(medical charges and employee salaries) compared the differentimplementations with its default settings.So there is still a lot to do (add datasets, investigate the actualdifferences between the different implementations and results, in amore structured way compare the options, etc, there are some todo'slisted in the README). However, now I am mostly on holidays for therest of December. If somebody wants to further look at it, that iscertainly welcome, otherwise, it will be a priority for me beginningof January.
For datasets: additional ideas are welcome. For now, the idea is toadd a subset of the Criteo Terabyte Click dataset, and to generatesome data.
>>> Does that mean you'd be opposed to adding the leave-one-out TargetEncoder
>>> I would really like to add it before February
>> A few month to get it right is not that bad, is it?
> The PR is over a year old already, and you hadn't voiced any opposition
> there.

As far as I understand, the open PR is not a leave-one-out TargetEncoder?

I would want it to be :-/

I also did not yet add the CountFeaturizer from that scikit-learn PR,because it is actually quite different (e.g it doesn't work forregression tasks, as it counts conditional on y). But forclassification it could be easily added to the benchmarks.

I'm confused now. That's what TargetEncoder and leave-one-outTargetEncoder do as well, right?

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

Reply via email to