Hello, I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer".
In the current implementation, the definition of term frequency is the number of times a term t occurs in document d. However, another definition that is very commonly used in practice is the term frequency adjusted for document length <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf = raw counts / document length. I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer. If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()". What do you think? If this sounds reasonable an worth it, I will send a PR. Thank you, Yacine.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn