[scikit-learn] DirtyData and the SuperVectorizer, for non-normalized dataframes

Gael Varoquaux Wed, 13 Oct 2021 07:42:42 -0700

Dear scikit-learn community,

I would like to announce a new release of dirty-cat, which strives to
facilitates machine-learning on non-curated categories: robust to
morphological variants, such as typos.


The new big feature, which I think is of interest to many, is the
"SuperVectorizer", that strives to readily vectorize a pandas dataframe:
https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#example-super-vectorizer

Of course, such an object is full of heuristics. We have tuned them
empirically, but we expect more progress in the long term, as we build a
bigger databases of dataframes that are difficult to vectorize. We'd love
people to join the adventure, it's been fun so far.

Cheers,

Gaël

-- 
    Gael Varoquaux
    Research Director, INRIA
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] DirtyData and the SuperVectorizer, for non-normalized dataframes

Reply via email to