Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

Roman Yurchak Tue, 30 Jan 2018 11:35:46 -0800

Hi Yacine,

On 29/01/18 16:39, Yacine MAZARI wrote:

>> I wouldn't hate if length normalisation was added toif it was shown that normalising before IDFmultiplication was more effective than (or complementary >> to) normingafterwards.
I think this is one of the most important points here.
Though not a formal proof, I can for example refer to:
  * NLTK
    <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>,
    which is using document-length-normalized term frequencies.

  * Manning and Schütze's Introduction to Information Retrieval
    
<https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classification-1.html>:
    "The same considerations that led us to prefer weighted
    representations, in particular length-normalized tf-idf
representations, in Chapters 6 7 also apply here."

I believe the conclusion of the Manning's Chapter 6 is the followingtable with TF-IDF weighting schemeshttps://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.htmlin which the document length normalization is applied _after_ the IDF.So "length-normalized tf-idf" is just TfidfVectorizer with norm='l1' aspreviously mentioned (at least, if you measure the document length asthe number of words it contains).More generally a weighting & normalization transformer for some of theother configurations in that table is implemented in


http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html

With respect to the NLTK implementation, seehttps://github.com/nltk/nltk/pull/979#issuecomment-102296527


So I don't think there is a need to change anything in TfidfTransformer...

--
Roman
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

Reply via email to