Hi Yacine,

On 29/01/18 16:39, Yacine MAZARI wrote:
>> I wouldn't hate if length normalisation was added to if it was shown that normalising before IDF multiplication was more effective than (or complementary >> to) norming afterwards.
I think this is one of the most important points here.
Though not a formal proof, I can for example refer to:

  * NLTK
    <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>,
    which is using document-length-normalized term frequencies.

  * Manning and Schütze's Introduction to Information Retrieval
    
<https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classification-1.html>:
    "The same considerations that led us to prefer weighted
    representations, in particular length-normalized tf-idf
representations, in Chapters 6 7 also apply here."

I believe the conclusion of the Manning's Chapter 6 is the following table with TF-IDF weighting schemes https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html in which the document length normalization is applied _after_ the IDF. So "length-normalized tf-idf" is just TfidfVectorizer with norm='l1' as previously mentioned (at least, if you measure the document length as the number of words it contains). More generally a weighting & normalization transformer for some of the other configurations in that table is implemented in

http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html

With respect to the NLTK implementation, see https://github.com/nltk/nltk/pull/979#issuecomment-102296527

So I don't think there is a need to change anything in TfidfTransformer...

--
Roman
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to