That's equivalent to Normalizer(norm='l1') or FunctionTransformer(np.linalg.norm, kw_args={'axis': 1, 'ord': 1}).
The problem is that length norm followed by TfidfTransformer now can't do sublinear TF right... But that's alright if we know we can always do FunctionTransformer(lambda X: calc_sublinear(X) / X.sum(axis=1)), perhaps then followed by applying IDF from TfidfTransformer. Yes, it's not straightforward, but it's very hard to provide a library that suits everyone's needs... so FunctionTransformer and Pipeline are your friends :) On 28 January 2018 at 20:36, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Good point Joel, and I actually forgot that you can set the norm param in > the TfidfVectorizer, so one could basically do > > vect = TfidfVectorizer(use_idf=False, norm='l1') > > to have the CountVectorizer behavior but normalizing by the document > length. > > Best, > Sebastian > > > On Jan 28, 2018, at 1:29 AM, Joel Nothman <joel.noth...@gmail.com> > wrote: > > > > sklearn.preprocessing.Normalizer allows you to normalize any vector by > its L1 or L2 norm. L1 would be equivalent to "document length" as long as > you did not intend to count stop words in the length. > sklearn.feature_extraction.text.TfidfTransformer offers similar norming, > but does so only after accounting for IDF or TF transformation. Since the > length normalisation transformation is stateless, it can also be computed > with a sklearn.preprocessing.FunctionTransformer. > > > > I can't say it's especially obvious that these features available, and > improvements to the documentation are welcome, but CountVectorizer is > complicated enough and we would rather avoid more parameters if we can. I > wouldn't hate if length normalisation was added to TfidfTransformer, if it > was shown that normalising before IDF multiplication was more effective > than (or complementary to) norming afterwards. > > > > On 28 January 2018 at 18:31, Yacine MAZARI <y.maz...@gmail.com> wrote: > > Hi Jake, > > > > Thanks for the quick reply. > > > > What I meant is different from the TfIdfVectorizer. Let me clarify: > > > > In the TfIdfVectorizer, the raw counts are multiplied by IDF, which > badically means normalizing the counts by document frequencies, tf * idf. > > But still, tf is deined here as the raw count of a term in the dicument. > > > > What I am suggesting, is to add the possibility to use another > definition of tf, tf= relative frequency of a term in a document = raw > counts / document length. > > On top of this, one could further normalize by IDF to get the TF-IDF ( > https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2). > > > > When can this be useful? Here is an example: > > Say term t occurs 5 times in document d1, and also 5 times in document > d2. > > At first glance, it seems that the term conveys the same information > about both documents. But if we also check document lengths, and find that > length of d1 is 20, wheras lenght of d2 is 200, then probably the > “importance” and information carried by the same term in the two documents > is not the same. > > If we use relative frequency instead of absolute counts, then > tf1=5/20=0.4 whereas tf2=5/200=0.04. > > > > There are many practical cases (document similarity, document > classification, etc...) where using relative frequencies yields better > results, and it might be worth making the CountVectorizer support this. > > > > Regards, > > Yacine. > > > > On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas < > jake...@cs.washington.edu> wrote: > > Hi Yacine, > > If I'm understanding you correctly, I think what you have in mind is > already implemented in scikit-learn in the TF-IDF vectorizer. > > > > Best, > > Jake > > > > Jake VanderPlas > > Senior Data Science Fellow > > Director of Open Software > > University of Washington eScience Institute > > > > On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.maz...@gmail.com> > wrote: > > Hello, > > > > I would like to work on adding an additional feature to > "sklearn.feature_extraction.text.CountVectorizer". > > > > In the current implementation, the definition of term frequency is the > number of times a term t occurs in document d. > > > > However, another definition that is very commonly used in practice is > the term frequency adjusted for document length, i.e: tf = raw counts / > document length. > > > > I intend to implement this by adding an additional boolean parameter > "relative_frequency" to the constructor of CountVectorizer. > > If the parameter is true, normalize X by document length (along x=1) in > "CountVectorizer.fit_transform()". > > > > What do you think? > > If this sounds reasonable an worth it, I will send a PR. > > > > Thank you, > > Yacine. > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn