Okay, thanks for the replies. @Joel: Should I go ahead and send a PR with the change to TfidfTransformer?
On Tue, Jan 30, 2018 at 5:27 AM, Joel Nothman <joel.noth...@gmail.com> wrote: > I don't think you will do this without an O(N) cost. The fact that it's > done with a second pass is moot. > > My position stands: if this change happens, it should be to > TfidfTransformer (which should perhaps be called something like > CountVectorWeighter!) alone. > > On 30 January 2018 at 02:39, Yacine MAZARI <y.maz...@gmail.com> wrote: > >> Hi Folks, >> >> Thank you all for the feedback and interesting discussion. >> >> I do realize that adding a feature comes with risks, and that there >> should really be compelling reasons to do so. >> >> Let me try to address your comments here, and make one final case for the >> value of this feature: >> >> 1) Use Normalizer, FunctionTransformer (or write a custom code) to >> perform normalization of CountVectorizer result: That would require an >> additional pass on the data. True that's "only" O(N), but if there is a way >> to speed up training an ML model, that'd be an advantage. >> >> 2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the >> same effect; but not that this not TF-IDF any more, in that TF-IDF is a >> two-fold normalization. If one needs TF-IDF (with normalized document >> counts), then 2 additional passes on the data (with >> TfidfVectorizer(use_idf=True)) >> would be required to get IDF normalization, bringing us to a case similar >> to the above. >> >> 3) >> >> I wouldn't hate if length normalisation was added to TfidfTransformer, >> if it was shown that normalising before IDF multiplication was more >> effective than (or complementary >> to) norming afterwards. >> I think this is one of the most important points here. >> Though not a formal proof, I can for example refer to: >> >> - NLTK <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>, >> which is using document-length-normalized term frequencies. >> >> >> - Manning and Schütze's Introduction to Information Retrieval >> >> <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classification-1.html>: >> "The same considerations that led us to prefer weighted representations, >> in >> particular length-normalized tf-idf representations, in Chapters 6 >> >> <https://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html#ch:tfidf> >> 7 >> >> <https://nlp.stanford.edu/IR-book/html/htmledition/computing-scores-in-a-complete-search-system-1.html#ch:cosine> >> also apply here." >> >> On the other hand, applying this kind of normalization to a corpus where >> the document lengths are similar (such as tweets) will probably not be of >> any advantage. >> >> 4) This will be a handy feature as Sebastian mentioned, and the code >> change will be very small (careful here...any code change brings risks). >> >> What do you think? >> >> Best regards, >> Yacine. >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn