Dear scikit-learn community, I would like to announce a new release of dirty-cat, which strives to facilitates machine-learning on non-curated categories: robust to morphological variants, such as typos.
The new big feature, which I think is of interest to many, is the "SuperVectorizer", that strives to readily vectorize a pandas dataframe: https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#example-super-vectorizer Of course, such an object is full of heuristics. We have tuned them empirically, but we expect more progress in the long term, as we build a bigger databases of dataframes that are difficult to vectorize. We'd love people to join the adventure, it's been fun so far. Cheers, Gaƫl -- Gael Varoquaux Research Director, INRIA http://gael-varoquaux.info http://twitter.com/GaelVaroquaux _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn