I would love to see the TargetEncoder ported to scikit-learn.
The CountFeaturizer is pretty stalled:
https://github.com/scikit-learn/scikit-learn/pull/9614

:-/

Have you benchmarked the other encoders in the category_encoding lib?
I would be really curious to know when/how they help.


On 11/20/18 3:58 PM, Gael Varoquaux wrote:
Hi scikit-learn friends,

As you might have seen on twitter, my lab -with a few friends- has
embarked on research to ease machine on "dirty data". We are
experimenting on new encoding methods for non-curated string categories.
For this, we are developing a small software project called "dirty_cat":
https://dirty-cat.github.io/stable/

dirty_cat is a test bed for new ideas of "dirty categories". It is a
research project, though we still try to do decent software engineering
:). Rather than contributing to existing codebases (as the great
categorical-encoding project in scikit-learn-contrib), we spanned it out
in a separate software project to have the freedom to try out ideas that
we might give up after gaining insight.

We hope that it is a useful tool: if you have non-curated string
categories, please give it a try. Understanding what works and what does
not is important to know what to consolidate. Hopefully one day we can
develop a tool that is of wide-enough interest that it can go in
scikit-learn-contrib, or maybe even scikit-learn.

Also, if you have suggestions of publicly available databases that we try
it upon, we would love to hear from you.

Cheers,

Gaƫl

PS: if you want to work on dirty-data problems in Paris as a post-doc or
an engineer, send me a line
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to