I don't think you will do this without an O(N) cost. The fact that it's done with a second pass is moot.
My position stands: if this change happens, it should be to TfidfTransformer (which should perhaps be called something like CountVectorWeighter!) alone. On 30 January 2018 at 02:39, Yacine MAZARI <y.maz...@gmail.com> wrote: > Hi Folks, > > Thank you all for the feedback and interesting discussion. > > I do realize that adding a feature comes with risks, and that there should > really be compelling reasons to do so. > > Let me try to address your comments here, and make one final case for the > value of this feature: > > 1) Use Normalizer, FunctionTransformer (or write a custom code) to perform > normalization of CountVectorizer result: That would require an additional > pass on the data. True that's "only" O(N), but if there is a way to speed > up training an ML model, that'd be an advantage. > > 2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the > same effect; but not that this not TF-IDF any more, in that TF-IDF is a > two-fold normalization. If one needs TF-IDF (with normalized document > counts), then 2 additional passes on the data (with > TfidfVectorizer(use_idf=True)) > would be required to get IDF normalization, bringing us to a case similar > to the above. > > 3) > >> I wouldn't hate if length normalisation was added to TfidfTransformer, > if it was shown that normalising before IDF multiplication was more > effective than (or complementary >> to) norming afterwards. > I think this is one of the most important points here. > Though not a formal proof, I can for example refer to: > > - NLTK <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>, > which is using document-length-normalized term frequencies. > > > - Manning and Schütze's Introduction to Information Retrieval > > <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classification-1.html>: > "The same considerations that led us to prefer weighted representations, in > particular length-normalized tf-idf representations, in Chapters 6 > > <https://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html#ch:tfidf> > 7 > > <https://nlp.stanford.edu/IR-book/html/htmledition/computing-scores-in-a-complete-search-system-1.html#ch:cosine> > also apply here." > > On the other hand, applying this kind of normalization to a corpus where > the document lengths are similar (such as tweets) will probably not be of > any advantage. > > 4) This will be a handy feature as Sebastian mentioned, and the code > change will be very small (careful here...any code change brings risks). > > What do you think? > > Best regards, > Yacine. > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn