Hi scikit-learn friends, As you might have seen on twitter, my lab -with a few friends- has embarked on research to ease machine on "dirty data". We are experimenting on new encoding methods for non-curated string categories. For this, we are developing a small software project called "dirty_cat": https://dirty-cat.github.io/stable/
dirty_cat is a test bed for new ideas of "dirty categories". It is a research project, though we still try to do decent software engineering :). Rather than contributing to existing codebases (as the great categorical-encoding project in scikit-learn-contrib), we spanned it out in a separate software project to have the freedom to try out ideas that we might give up after gaining insight. We hope that it is a useful tool: if you have non-curated string categories, please give it a try. Understanding what works and what does not is important to know what to consolidate. Hopefully one day we can develop a tool that is of wide-enough interest that it can go in scikit-learn-contrib, or maybe even scikit-learn. Also, if you have suggestions of publicly available databases that we try it upon, we would love to hear from you. Cheers, Gaƫl PS: if you want to work on dirty-data problems in Paris as a post-doc or an engineer, send me a line _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn