Hi Folks, Thank you all for the feedback and interesting discussion.
I do realize that adding a feature comes with risks, and that there should really be compelling reasons to do so. Let me try to address your comments here, and make one final case for the value of this feature: 1) Use Normalizer, FunctionTransformer (or write a custom code) to perform normalization of CountVectorizer result: That would require an additional pass on the data. True that's "only" O(N), but if there is a way to speed up training an ML model, that'd be an advantage. 2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the same effect; but not that this not TF-IDF any more, in that TF-IDF is a two-fold normalization. If one needs TF-IDF (with normalized document counts), then 2 additional passes on the data (with TfidfVectorizer(use_idf=True)) would be required to get IDF normalization, bringing us to a case similar to the above. 3) >> I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary >> to) norming afterwards. I think this is one of the most important points here. Though not a formal proof, I can for example refer to: - NLTK <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>, which is using document-length-normalized term frequencies. - Manning and Schütze's Introduction to Information Retrieval <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classification-1.html>: "The same considerations that led us to prefer weighted representations, in particular length-normalized tf-idf representations, in Chapters 6 <https://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html#ch:tfidf> 7 <https://nlp.stanford.edu/IR-book/html/htmledition/computing-scores-in-a-complete-search-system-1.html#ch:cosine> also apply here." On the other hand, applying this kind of normalization to a corpus where the document lengths are similar (such as tweets) will probably not be of any advantage. 4) This will be a handy feature as Sebastian mentioned, and the code change will be very small (careful here...any code change brings risks). What do you think? Best regards, Yacine.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn