
I would like to work on adding an additional feature to

In the current implementation, the definition of term frequency is the
number of times a term t occurs in document d.

However, another definition that is very commonly used in practice is the term
frequency adjusted for document length
<https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf =
raw counts / document length.

I intend to implement this by adding an additional boolean parameter
"relative_frequency" to the constructor of CountVectorizer.
If the parameter is true, normalize X by document length (along x=1) in

What do you think?
If this sounds reasonable an worth it, I will send a PR.

Thank you,
scikit-learn mailing list

Reply via email to