Good point Joel, and I actually forgot that you can set the norm param in the TfidfVectorizer, so one could basically do
vect = TfidfVectorizer(use_idf=False, norm='l1') to have the CountVectorizer behavior but normalizing by the document length. Best, Sebastian > On Jan 28, 2018, at 1:29 AM, Joel Nothman <joel.noth...@gmail.com> wrote: > > sklearn.preprocessing.Normalizer allows you to normalize any vector by its L1 > or L2 norm. L1 would be equivalent to "document length" as long as you did > not intend to count stop words in the length. > sklearn.feature_extraction.text.TfidfTransformer offers similar norming, but > does so only after accounting for IDF or TF transformation. Since the length > normalisation transformation is stateless, it can also be computed with a > sklearn.preprocessing.FunctionTransformer. > > I can't say it's especially obvious that these features available, and > improvements to the documentation are welcome, but CountVectorizer is > complicated enough and we would rather avoid more parameters if we can. I > wouldn't hate if length normalisation was added to TfidfTransformer, if it > was shown that normalising before IDF multiplication was more effective than > (or complementary to) norming afterwards. > > On 28 January 2018 at 18:31, Yacine MAZARI <y.maz...@gmail.com> wrote: > Hi Jake, > > Thanks for the quick reply. > > What I meant is different from the TfIdfVectorizer. Let me clarify: > > In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically > means normalizing the counts by document frequencies, tf * idf. > But still, tf is deined here as the raw count of a term in the dicument. > > What I am suggesting, is to add the possibility to use another definition of > tf, tf= relative frequency of a term in a document = raw counts / document > length. > On top of this, one could further normalize by IDF to get the TF-IDF > (https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2). > > When can this be useful? Here is an example: > Say term t occurs 5 times in document d1, and also 5 times in document d2. > At first glance, it seems that the term conveys the same information about > both documents. But if we also check document lengths, and find that length > of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and > information carried by the same term in the two documents is not the same. > If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 > whereas tf2=5/200=0.04. > > There are many practical cases (document similarity, document classification, > etc...) where using relative frequencies yields better results, and it might > be worth making the CountVectorizer support this. > > Regards, > Yacine. > > On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <jake...@cs.washington.edu> > wrote: > Hi Yacine, > If I'm understanding you correctly, I think what you have in mind is already > implemented in scikit-learn in the TF-IDF vectorizer. > > Best, > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Open Software > University of Washington eScience Institute > > On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.maz...@gmail.com> wrote: > Hello, > > I would like to work on adding an additional feature to > "sklearn.feature_extraction.text.CountVectorizer". > > In the current implementation, the definition of term frequency is the number > of times a term t occurs in document d. > > However, another definition that is very commonly used in practice is the > term frequency adjusted for document length, i.e: tf = raw counts / document > length. > > I intend to implement this by adding an additional boolean parameter > "relative_frequency" to the constructor of CountVectorizer. > If the parameter is true, normalize X by document length (along x=1) in > "CountVectorizer.fit_transform()". > > What do you think? > If this sounds reasonable an worth it, I will send a PR. > > Thank you, > Yacine. > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn