[scikit-learn] CountVectorizer: Additional Feature Suggestion

Yacine MAZARI Sat, 27 Jan 2018 22:01:58 -0800

Hello,

I would like to work on adding an additional feature to
"sklearn.feature_extraction.text.CountVectorizer".


In the current implementation, the definition of term frequency is the
number of times a term t occurs in document d.

However, another definition that is very commonly used in practice is the term
frequency adjusted for document length
<https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf =
raw counts / document length.

I intend to implement this by adding an additional boolean parameter
"relative_frequency" to the constructor of CountVectorizer.
If the parameter is true, normalize X by document length (along x=1) in
"CountVectorizer.fit_transform()".

What do you think?
If this sounds reasonable an worth it, I will send a PR.

Thank you,
Yacine.

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] CountVectorizer: Additional Feature Suggestion

Reply via email to