Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

Sebastian Raschka Sun, 28 Jan 2018 01:39:12 -0800

Good point Joel, and I actually forgot that you can set the norm param in the 
TfidfVectorizer, so one could basically do


vect = TfidfVectorizer(use_idf=False, norm='l1')

to have the CountVectorizer behavior but normalizing by the document length.

Best,
Sebastian

> On Jan 28, 2018, at 1:29 AM, Joel Nothman <joel.noth...@gmail.com> wrote:
> 
> sklearn.preprocessing.Normalizer allows you to normalize any vector by its L1 
> or L2 norm. L1 would be equivalent to "document length" as long as you did 
> not intend to count stop words in the length. 
> sklearn.feature_extraction.text.TfidfTransformer offers similar norming, but 
> does so only after accounting for IDF or TF transformation. Since the length 
> normalisation transformation is stateless, it can also be computed with a 
> sklearn.preprocessing.FunctionTransformer.
> 
> I can't say it's especially obvious that these features available, and 
> improvements to the documentation are welcome, but CountVectorizer is 
> complicated enough and we would rather avoid more parameters if we can. I 
> wouldn't hate if length normalisation was added to TfidfTransformer, if it 
> was shown that normalising before IDF multiplication was more effective than 
> (or complementary to) norming afterwards.
> 
> On 28 January 2018 at 18:31, Yacine MAZARI <y.maz...@gmail.com> wrote:
> Hi Jake,
> 
> Thanks for the quick reply.
> 
> What I meant is different from the TfIdfVectorizer. Let me clarify:
> 
> In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically 
> means normalizing the counts by document frequencies, tf * idf.
> But still, tf is deined here as the raw count of a term in the dicument.
> 
> What I am suggesting, is to add the possibility to use another definition of 
> tf, tf= relative frequency of a term in a document = raw counts / document 
> length.
> On top of this, one could further normalize by IDF to get the TF-IDF 
> (https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).
> 
> When can this be useful? Here is an example:
> Say term t occurs 5 times in document d1, and also 5 times in document d2.
> At first glance, it seems that the term conveys the same information about 
> both documents. But if we also check document lengths, and find that length 
> of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and 
> information carried by the same term in the two documents is not the same.
> If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 
> whereas tf2=5/200=0.04.
> 
> There are many practical cases (document similarity, document classification, 
> etc...) where using relative frequencies yields better results, and it might 
> be worth making the CountVectorizer support this.
> 
> Regards,
> Yacine.
> 
> On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <jake...@cs.washington.edu> 
> wrote:
> Hi Yacine,
> If I'm understanding you correctly, I think what you have in mind is already 
> implemented in scikit-learn in the TF-IDF vectorizer.
> 
> Best,
>    Jake
> 
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Open Software
>  University of Washington eScience Institute
> 
> On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.maz...@gmail.com> wrote:
> Hello,
> 
> I would like to work on adding an additional feature to 
> "sklearn.feature_extraction.text.CountVectorizer".
> 
> In the current implementation, the definition of term frequency is the number 
> of times a term t occurs in document d.
> 
> However, another definition that is very commonly used in practice is the 
> term frequency adjusted for document length, i.e: tf = raw counts / document 
> length.
> 
> I intend to implement this by adding an additional boolean parameter 
> "relative_frequency" to the constructor of CountVectorizer.
> If the parameter is true, normalize X by document length (along x=1) in 
> "CountVectorizer.fit_transform()".
> 
> What do you think?
> If this sounds reasonable an worth it, I will send a PR.
> 
> Thank you,
> Yacine.
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

Reply via email to