Hi Yacine, If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer <http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html> .
Best, Jake Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.maz...@gmail.com> wrote: > Hello, > > I would like to work on adding an additional feature to > "sklearn.feature_extraction.text.CountVectorizer". > > In the current implementation, the definition of term frequency is the > number of times a term t occurs in document d. > > However, another definition that is very commonly used in practice is the term > frequency adjusted for document length > <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf > = raw counts / document length. > > I intend to implement this by adding an additional boolean parameter > "relative_frequency" to the constructor of CountVectorizer. > If the parameter is true, normalize X by document length (along x=1) in > "CountVectorizer.fit_transform()". > > What do you think? > If this sounds reasonable an worth it, I will send a PR. > > Thank you, > Yacine. > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn