Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

Jacob Vanderplas Sat, 27 Jan 2018 22:13:14 -0800

Hi Yacine,
If I'm understanding you correctly, I think what you have in mind is
already implemented in scikit-learn in the TF-IDF vectorizer
<http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html>
.


Best,
   Jake

 Jake VanderPlas
 Senior Data Science Fellow
 Director of Open Software
 University of Washington eScience Institute

On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.maz...@gmail.com> wrote:

> Hello,
>
> I would like to work on adding an additional feature to
> "sklearn.feature_extraction.text.CountVectorizer".
>
> In the current implementation, the definition of term frequency is the
> number of times a term t occurs in document d.
>
> However, another definition that is very commonly used in practice is the term
> frequency adjusted for document length
> <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf
> = raw counts / document length.
>
> I intend to implement this by adding an additional boolean parameter
> "relative_frequency" to the constructor of CountVectorizer.
> If the parameter is true, normalize X by document length (along x=1) in
> "CountVectorizer.fit_transform()".
>
> What do you think?
> If this sounds reasonable an worth it, I will send a PR.
>
> Thank you,
> Yacine.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

Reply via email to